Computer Vision and Pattern Recognition 150
☆ Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark CVPR 2024
Ziyang Chen, Israel D. Gebru, Christian Richardt, Anurag Kumar, William Laney, Andrew Owens, Alexander Richard
We present a new dataset called Real Acoustic Fields (RAF) that captures real
acoustic room data from multiple modalities. The dataset includes high-quality
and densely captured room impulse response data paired with multi-view images,
and precise 6DoF pose tracking data for sound emitters and listeners in the
rooms. We used this dataset to evaluate existing methods for novel-view
acoustic synthesis and impulse response generation which previously relied on
synthetic data. In our evaluation, we thoroughly assessed existing audio and
audio-visual models against multiple criteria and proposed settings to enhance
their performance on real-world data. We also conducted experiments to
investigate the impact of incorporating visual data (i.e., images and depth)
into neural acoustic field models. Additionally, we demonstrated the
effectiveness of a simple sim2real approach, where a model is pre-trained with
simulated data and fine-tuned with sparse real-world data, resulting in
significant improvements in the few-shot learning approach. RAF is the first
dataset to provide densely captured room acoustic data, making it an ideal
resource for researchers working on audio and audio-visual neural acoustic
field modeling techniques. Demos and datasets are available on our project
page: https://facebookresearch.github.io/real-acoustic-fields/
comment: Accepted to CVPR 2024. Project site:
https://facebookresearch.github.io/real-acoustic-fields/
☆ MetaCap: Meta-learning Priors from Multi-View Imagery for Sparse-view Human Performance Capture and Rendering
Faithful human performance capture and free-view rendering from sparse RGB
observations is a long-standing problem in Vision and Graphics. The main
challenges are the lack of observations and the inherent ambiguities of the
setting, e.g. occlusions and depth ambiguity. As a result, radiance fields,
which have shown great promise in capturing high-frequency appearance and
geometry details in dense setups, perform poorly when na\"ively supervising
them on sparse camera views, as the field simply overfits to the sparse-view
inputs. To address this, we propose MetaCap, a method for efficient and
high-quality geometry recovery and novel view synthesis given very sparse or
even a single view of the human. Our key idea is to meta-learn the radiance
field weights solely from potentially sparse multi-view videos, which can serve
as a prior when fine-tuning them on sparse imagery depicting the human. This
prior provides a good network weight initialization, thereby effectively
addressing ambiguities in sparse-view capture. Due to the articulated structure
of the human body and motion-induced surface deformations, learning such a
prior is non-trivial. Therefore, we propose to meta-learn the field weights in
a pose-canonicalized space, which reduces the spatial feature range and makes
feature learning more effective. Consequently, one can fine-tune our field
parameters to quickly generalize to unseen poses, novel illumination conditions
as well as novel and sparse (even monocular) camera views. For evaluating our
method under different scenarios, we collect a new dataset, WildDynaCap, which
contains subjects captured in, both, a dense camera dome and in-the-wild sparse
camera rigs, and demonstrate superior results compared to recent
state-of-the-art methods on both public and WildDynaCap dataset.
comment: Project page: https://vcai.mpi-inf.mpg.de/projects/MetaCap/
☆ Benchmarking Object Detectors with COCO: A New Path Forward
The Common Objects in Context (COCO) dataset has been instrumental in
benchmarking object detectors over the past decade. Like every dataset, COCO
contains subtle errors and imperfections stemming from its annotation
procedure. With the advent of high-performing models, we ask whether these
errors of COCO are hindering its utility in reliably benchmarking further
progress. In search for an answer, we inspect thousands of masks from COCO
(2017 version) and uncover different types of errors such as imprecise mask
boundaries, non-exhaustively annotated instances, and mislabeled masks. Due to
the prevalence of COCO, we choose to correct these errors to maintain
continuity with prior research. We develop COCO-ReM (Refined Masks), a cleaner
set of annotations with visibly better mask quality than COCO-2017. We evaluate
fifty object detectors and find that models that predict visually sharper masks
score higher on COCO-ReM, affirming that they were being incorrectly penalized
due to errors in COCO-2017. Moreover, our models trained using COCO-ReM
converge faster and score higher than their larger variants trained using
COCO-2017, highlighting the importance of data quality in improving object
detectors. With these findings, we advocate using COCO-ReM for future object
detection research. Our dataset is available at https://cocorem.xyz
comment: Technical report. Dataset website: https://cocorem.xyz and code:
https://github.com/kdexd/coco-rem
☆ ObjectDrop: Bootstrapping Counterfactuals for Photorealistic Object Removal and Insertion
Diffusion models have revolutionized image editing but often generate images
that violate physical laws, particularly the effects of objects on the scene,
e.g., occlusions, shadows, and reflections. By analyzing the limitations of
self-supervised approaches, we propose a practical solution centered on a
\q{counterfactual} dataset. Our method involves capturing a scene before and
after removing a single object, while minimizing other changes. By fine-tuning
a diffusion model on this dataset, we are able to not only remove objects but
also their effects on the scene. However, we find that applying this approach
for photorealistic object insertion requires an impractically large dataset. To
tackle this challenge, we propose bootstrap supervision; leveraging our object
removal model trained on a small counterfactual dataset, we synthetically
expand this dataset considerably. Our approach significantly outperforms prior
methods in photorealistic object removal and insertion, particularly at
modeling the effects of objects on the scene.
☆ Garment3DGen: 3D Garment Stylization and Texture Generation
We introduce Garment3DGen a new method to synthesize 3D garment assets from a
base mesh given a single input image as guidance. Our proposed approach allows
users to generate 3D textured clothes based on both real and synthetic images,
such as those generated by text prompts. The generated assets can be directly
draped and simulated on human bodies. First, we leverage the recent progress of
image to 3D diffusion methods to generate 3D garment geometries. However, since
these geometries cannot be utilized directly for downstream tasks, we propose
to use them as pseudo ground-truth and set up a mesh deformation optimization
procedure that deforms a base template mesh to match the generated 3D target.
Second, we introduce carefully designed losses that allow the input base mesh
to freely deform towards the desired target, yet preserve mesh quality and
topology such that they can be simulated. Finally, a texture estimation module
generates high-fidelity texture maps that are globally and locally consistent
and faithfully capture the input guidance, allowing us to render the generated
3D assets. With Garment3DGen users can generate the textured 3D garment of
their choice without the need of artist intervention. One can provide a textual
prompt describing the garment they desire to generate a simulation-ready 3D
asset. We present a plethora of quantitative and qualitative comparisons on
various assets both real and generated and provide use-cases of how one can
generate simulation-ready 3D garments.
comment: Project Page: https://nsarafianos.github.io/garment3dgen
☆ Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, Jiaya Jia
In this work, we introduce Mini-Gemini, a simple and effective framework
enhancing multi-modality Vision Language Models (VLMs). Despite the
advancements in VLMs facilitating basic visual dialog and reasoning, a
performance gap persists compared to advanced models like GPT-4 and Gemini. We
try to narrow the gap by mining the potential of VLMs for better performance
and any-to-any workflow from three aspects, i.e., high-resolution visual
tokens, high-quality data, and VLM-guided generation. To enhance visual tokens,
we propose to utilize an additional visual encoder for high-resolution
refinement without increasing the visual token count. We further construct a
high-quality dataset that promotes precise image comprehension and
reasoning-based generation, expanding the operational scope of current VLMs. In
general, Mini-Gemini further mines the potential of VLMs and empowers current
frameworks with image understanding, reasoning, and generation simultaneously.
Mini-Gemini supports a series of dense and MoE Large Language Models (LLMs)
from 2B to 34B. It is demonstrated to achieve leading performance in several
zero-shot benchmarks and even surpasses the developed private models. Code and
models are available at https://github.com/dvlab-research/MiniGemini.
comment: Code and models are available at
https://github.com/dvlab-research/MiniGemini
☆ Duolando: Follower GPT with Off-Policy Reinforcement Learning for Dance Accompaniment ICLR 2024
We introduce a novel task within the field of 3D dance generation, termed
dance accompaniment, which necessitates the generation of responsive movements
from a dance partner, the "follower", synchronized with the lead dancer's
movements and the underlying musical rhythm. Unlike existing solo or group
dance generation tasks, a duet dance scenario entails a heightened degree of
interaction between the two participants, requiring delicate coordination in
both pose and position. To support this task, we first build a large-scale and
diverse duet interactive dance dataset, DD100, by recording about 117 minutes
of professional dancers' performances. To address the challenges inherent in
this task, we propose a GPT-based model, Duolando, which autoregressively
predicts the subsequent tokenized motion conditioned on the coordinated
information of the music, the leader's and the follower's movements. To further
enhance the GPT's capabilities of generating stable results on unseen
conditions (music and leader motions), we devise an off-policy reinforcement
learning strategy that allows the model to explore viable trajectories from
out-of-distribution samplings, guided by human-defined rewards. Based on the
collected dataset and proposed method, we establish a benchmark with several
carefully designed metrics.
comment: ICLR 2024
☆ ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation CVPR
In the absence of parallax cues, a learning-based single image depth
estimation (SIDE) model relies heavily on shading and contextual cues in the
image. While this simplicity is attractive, it is necessary to train such
models on large and varied datasets, which are difficult to capture. It has
been shown that using embeddings from pre-trained foundational models, such as
CLIP, improves zero shot transfer in several applications. Taking inspiration
from this, in our paper we explore the use of global image priors generated
from a pre-trained ViT model to provide more detailed contextual information.
We argue that the embedding vector from a ViT model, pre-trained on a large
dataset, captures greater relevant information for SIDE than the usual route of
generating pseudo image captions, followed by CLIP based text embeddings. Based
on this idea, we propose a new SIDE model using a diffusion backbone which is
conditioned on ViT embeddings. Our proposed design establishes a new
state-of-the-art (SOTA) for SIDE on NYUv2 dataset, achieving Abs Rel error of
0.059(14% improvement) compared to 0.069 by the current SOTA (VPD). And on
KITTI dataset, achieving Sq Rel error of 0.139 (2% improvement) compared to
0.142 by the current SOTA (GEDepth). For zero-shot transfer with a model
trained on NYUv2, we report mean relative improvement of (20%, 23%, 81%, 25%)
over NeWCRFs on (Sun-RGBD, iBims1, DIODE, HyperSim) datasets, compared to (16%,
18%, 45%, 9%) by ZoeDepth. The code is available at
https://github.com/Aradhye2002/EcoDepth.
comment: Accepted at IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR) 2024
☆ Gamba: Marry Gaussian Splatting with Mamba for single view 3D reconstruction
We tackle the challenge of efficiently reconstructing a 3D asset from a
single image with growing demands for automated 3D content creation pipelines.
Previous methods primarily rely on Score Distillation Sampling (SDS) and Neural
Radiance Fields (NeRF). Despite their significant success, these approaches
encounter practical limitations due to lengthy optimization and considerable
memory usage. In this report, we introduce Gamba, an end-to-end amortized 3D
reconstruction model from single-view images, emphasizing two main insights:
(1) 3D representation: leveraging a large number of 3D Gaussians for an
efficient 3D Gaussian splatting process; (2) Backbone design: introducing a
Mamba-based sequential network that facilitates context-dependent reasoning and
linear scalability with the sequence (token) length, accommodating a
substantial number of Gaussians. Gamba incorporates significant advancements in
data preprocessing, regularization design, and training methodologies. We
assessed Gamba against existing optimization-based and feed-forward 3D
generation approaches using the real-world scanned OmniObject3D dataset. Here,
Gamba demonstrates competitive generation capabilities, both qualitatively and
quantitatively, while achieving remarkable speed, approximately 0.6 second on a
single NVIDIA A100 GPU.
☆ Object Pose Estimation via the Aggregation of Diffusion Features CVPR2024
Estimating the pose of objects from images is a crucial task of 3D scene
understanding, and recent approaches have shown promising results on very large
benchmarks. However, these methods experience a significant performance drop
when dealing with unseen objects. We believe that it results from the limited
generalizability of image features. To address this problem, we have an
in-depth analysis on the features of diffusion models, e.g. Stable Diffusion,
which hold substantial potential for modeling unseen objects. Based on this
analysis, we then innovatively introduce these diffusion features for object
pose estimation. To achieve this, we propose three distinct architectures that
can effectively capture and aggregate diffusion features of different
granularity, greatly improving the generalizability of object pose estimation.
Our approach outperforms the state-of-the-art methods by a considerable margin
on three popular benchmark datasets, LM, O-LM, and T-LESS. In particular, our
method achieves higher accuracy than the previous best arts on unseen objects:
98.2% vs. 93.5% on Unseen LM, 85.9% vs. 76.3% on Unseen O-LM, showing the
strong generalizability of our method. Our code is released at
https://github.com/Tianfu18/diff-feats-pose.
comment: Accepted to CVPR2024
☆ SplatFace: Gaussian Splat Face Reconstruction Leveraging an Optimizable Surface
We present SplatFace, a novel Gaussian splatting framework designed for 3D
human face reconstruction without reliance on accurate pre-determined geometry.
Our method is designed to simultaneously deliver both high-quality novel view
rendering and accurate 3D mesh reconstructions. We incorporate a generic 3D
Morphable Model (3DMM) to provide a surface geometric structure, making it
possible to reconstruct faces with a limited set of input images. We introduce
a joint optimization strategy that refines both the Gaussians and the morphable
surface through a synergistic non-rigid alignment process. A novel distance
metric, splat-to-surface, is proposed to improve alignment by considering both
the Gaussian position and covariance. The surface information is also utilized
to incorporate a world-space densification process, resulting in superior
reconstruction quality. Our experimental analysis demonstrates that the
proposed method is competitive with both other Gaussian splatting techniques in
novel view synthesis and other 3D reconstruction methods in producing 3D face
meshes with high geometric precision.
☆ ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object CVPR 2024
We establish rigorous benchmarks for visual perception robustness. Synthetic
images such as ImageNet-C, ImageNet-9, and Stylized ImageNet provide specific
type of evaluation over synthetic corruptions, backgrounds, and textures, yet
those robustness benchmarks are restricted in specified variations and have low
synthetic quality. In this work, we introduce generative model as a data source
for synthesizing hard images that benchmark deep models' robustness. Leveraging
diffusion models, we are able to generate images with more diversified
backgrounds, textures, and materials than any prior work, where we term this
benchmark as ImageNet-D. Experimental results show that ImageNet-D results in a
significant accuracy drop to a range of vision models, from the standard ResNet
visual classifier to the latest foundation models like CLIP and MiniGPT-4,
significantly reducing their accuracy by up to 60\%. Our work suggests that
diffusion models can be an effective source to test vision models. The code and
dataset are available at https://github.com/chenshuang-zhang/imagenet_d.
comment: Accepted at CVPR 2024
☆ ModaLink: Unifying Modalities for Efficient Image-to-PointCloud Place Recognition
Weidong Xie, Lun Luo, Nanfei Ye, Yi Ren, Shaoyi Du, Minhang Wang, Jintao Xu, Rui Ai, Weihao Gu, Xieyuanli Chen
Place recognition is an important task for robots and autonomous cars to
localize themselves and close loops in pre-built maps. While single-modal
sensor-based methods have shown satisfactory performance, cross-modal place
recognition that retrieving images from a point-cloud database remains a
challenging problem. Current cross-modal methods transform images into 3D
points using depth estimation for modality conversion, which are usually
computationally intensive and need expensive labeled data for depth
supervision. In this work, we introduce a fast and lightweight framework to
encode images and point clouds into place-distinctive descriptors. We propose
an effective Field of View (FoV) transformation module to convert point clouds
into an analogous modality as images. This module eliminates the necessity for
depth estimation and helps subsequent modules achieve real-time performance. We
further design a non-negative factorization-based encoder to extract mutually
consistent semantic features between point clouds and images. This encoder
yields more distinctive global descriptors for retrieval. Experimental results
on the KITTI dataset show that our proposed methods achieve state-of-the-art
performance while running in real time. Additional evaluation on the HAOMO
dataset covering a 17 km trajectory further shows the practical generalization
capabilities. We have released the implementation of our methods as open source
at: https://github.com/haomo-ai/ModaLink.git.
comment: 8 pages, 11 figures, conference
☆ Detection of subclinical atherosclerosis by image-based deep learning on chest x-ray
Guglielmo Gallone, Francesco Iodice, Alberto Presta, Davide Tore, Ovidio de Filippo, Michele Visciano, Carlo Alberto Barbano, Alessandro Serafini, Paola Gorrini, Alessandro Bruno, Walter Grosso Marra, James Hughes, Mario Iannaccone, Paolo Fonio, Attilio Fiandrotti, Alessandro Depaoli, Marco Grangetto, Gaetano Maria de Ferrari, Fabrizio D'Ascenzo
Aims. To develop a deep-learning based system for recognition of subclinical
atherosclerosis on a plain frontal chest x-ray. Methods and Results. A
deep-learning algorithm to predict coronary artery calcium (CAC) score (the
AI-CAC model) was developed on 460 chest x-ray (80% training cohort, 20%
internal validation cohort) of primary prevention patients (58.4% male, median
age 63 [51-74] years) with available paired chest x-ray and chest computed
tomography (CT) indicated for any clinical reason and performed within 3
months. The CAC score calculated on chest CT was used as ground truth. The
model was validated on an temporally-independent cohort of 90 patients from the
same institution (external validation). The diagnostic accuracy of the AI-CAC
model assessed by the area under the curve (AUC) was the primary outcome.
Overall, median AI-CAC score was 35 (0-388) and 28.9% patients had no AI-CAC.
AUC of the AI-CAC model to identify a CAC>0 was 0.90 in the internal validation
cohort and 0.77 in the external validation cohort. Sensitivity was consistently
above 92% in both cohorts. In the overall cohort (n=540), among patients with
AI-CAC=0, a single ASCVD event occurred, after 4.3 years. Patients with
AI-CAC>0 had significantly higher Kaplan Meier estimates for ASCVD events
(13.5% vs. 3.4%, log-rank=0.013). Conclusion. The AI-CAC model seems to
accurately detect subclinical atherosclerosis on chest x-ray with elevated
sensitivity, and to predict ASCVD events with elevated negative predictive
value. Adoption of the AI-CAC model to refine CV risk stratification or as an
opportunistic screening tool requires prospective evaluation.
comment: Submitted to European Heart Journal - Cardiovascular Imaging Added
also the additional material 44 pages (30 main paper, 14 additional
material), 14 figures (5 main manuscript, 9 additional material)
☆ A vascular synthetic model for improved aneurysm segmentation and detection via Deep Neural Networks
We hereby present a full synthetic model, able to mimic the various
constituents of the cerebral vascular tree: the cerebral arteries, the
bifurcations and the intracranial aneurysms. By building this model, our goal
was to provide a substantial dataset of brain arteries which could be used by a
3D Convolutional Neural Network (CNN) to either segment or detect/recognize
various vascular diseases (such as artery dissection/thrombosis) or even some
portions of the cerebral vasculature, such as the bifurcations or aneurysms. In
this study, we will particularly focus on Intra-Cranial Aneurysm (ICA)
detection and segmentation. The cerebral aneurysms most often occur on a
particular structure of the vascular tree named the Circle of Willis. Various
studies have been conducted to detect and monitor the ICAs and those based on
Deep Learning (DL) achieve the best performances. Specifically, in this work,
we propose a full synthetic 3D model able to mimic the brain vasculature as
acquired by Magnetic Resonance Angiography (MRA), and more particularly the
Time Of Flight (TOF) principle. Among the various MRI modalities, the MRA-TOF
allows to have a relatively good rendering of the blood vessels and is
non-invasive (no contrast liquid injection). Our model has been designed to
simultaneously mimic the arteries geometry, the ICA shape and the background
noise. The geometry of the vascular tree is modeled thanks to an interpolation
with 3D Spline functions, and the statistical properties of the background MRI
noise is collected from MRA acquisitions and reproduced within the model. In
this work, we thoroughly describe the synthetic vasculature model, we build up
a neural network designed for ICA segmentation and detection, and finally, we
carry out an in-depth evaluation of the performance gap gained thanks to the
synthetic model data augmentation.
☆ Enhancing Manufacturing Quality Prediction Models through the Integration of Explainability Methods
This research presents a method that utilizes explainability techniques to
amplify the performance of machine learning (ML) models in forecasting the
quality of milling processes, as demonstrated in this paper through a
manufacturing use case. The methodology entails the initial training of ML
models, followed by a fine-tuning phase where irrelevant features identified
through explainability methods are eliminated. This procedural refinement
results in performance enhancements, paving the way for potential reductions in
manufacturing costs and a better understanding of the trained ML models. This
study highlights the usefulness of explainability techniques in both explaining
and optimizing predictive models in the manufacturing realm.
☆ Towards Image Ambient Lighting Normalization
Lighting normalization is a crucial but underexplored restoration task with
broad applications. However, existing works often simplify this task within the
context of shadow removal, limiting the light sources to one and
oversimplifying the scene, thus excluding complex self-shadows and restricting
surface classes to smooth ones. Although promising, such simplifications hinder
generalizability to more realistic settings encountered in daily use. In this
paper, we propose a new challenging task termed Ambient Lighting Normalization
(ALN), which enables the study of interactions between shadows, unifying image
restoration and shadow removal in a broader context. To address the lack of
appropriate datasets for ALN, we introduce the large-scale high-resolution
dataset Ambient6K, comprising samples obtained from multiple light sources and
including self-shadows resulting from complex geometries, which is the first of
its kind. For benchmarking, we select various mainstream methods and rigorously
evaluate them on Ambient6K. Additionally, we propose IFBlend, a novel strong
baseline that maximizes Image-Frequency joint entropy to selectively restore
local areas under different lighting conditions, without relying on shadow
localization priors. Experiments show that IFBlend achieves SOTA scores on
Ambient6K and exhibits competitive performance on conventional shadow removal
benchmarks compared to shadow-specific models with mask priors. The dataset,
benchmark, and code are available at https://github.com/fvasluianu97/IFBlend.
☆ Semi-Supervised Learning for Deep Causal Generative Models
Developing models that can answer questions of the form "How would $x$ change
if $y$ had been $z$?" is fundamental for advancing medical image analysis.
Training causal generative models that address such counterfactual questions,
though, currently requires that all relevant variables have been observed and
that corresponding labels are available in training data. However, clinical
data may not have complete records for all patients and state of the art causal
generative models are unable to take full advantage of this. We thus develop,
for the first time, a semi-supervised deep causal generative model that
exploits the causal relationships between variables to maximise the use of all
available data. We explore this in the setting where each sample is either
fully labelled or fully unlabelled, as well as the more clinically realistic
case of having different labels missing for each sample. We leverage techniques
from causal inference to infer missing values and subsequently generate
realistic counterfactuals, even for samples with incomplete labels.
☆ Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
Large Vision-Language Models (LVLMs) are increasingly adept at generating
contextually detailed and coherent responses from visual inputs. However, their
application in multimodal decision-making and open-ended generation is hindered
by a notable rate of hallucinations, where generated text inaccurately
represents the visual contents. To address this issue, this paper introduces
the Instruction Contrastive Decoding (ICD) method, a novel approach designed to
reduce hallucinations during LVLM inference. Our method is inspired by our
observation that what we call disturbance instructions significantly exacerbate
hallucinations in multimodal fusion modules. ICD contrasts distributions from
standard and instruction disturbance, thereby increasing alignment uncertainty
and effectively subtracting hallucinated concepts from the original
distribution. Through comprehensive experiments on discriminative benchmarks
(POPE and MME) and a generative benchmark (LLaVa-Bench), we demonstrate that
ICD significantly mitigates both object-level and attribute-level
hallucinations. Moreover, our method not only addresses hallucinations but also
significantly enhances the general perception and recognition capabilities of
LVLMs.
☆ Bringing Textual Prompt to AI-Generated Image Quality Assessment ICME2024
AI-Generated Images (AGIs) have inherent multimodal nature. Unlike
traditional image quality assessment (IQA) on natural scenarios, AGIs quality
assessment (AGIQA) takes the correspondence of image and its textual prompt
into consideration. This is coupled in the ground truth score, which confuses
the unimodal IQA methods. To solve this problem, we introduce IP-IQA (AGIs
Quality Assessment via Image and Prompt), a multimodal framework for AGIQA via
corresponding image and prompt incorporation. Specifically, we propose a novel
incremental pretraining task named Image2Prompt for better understanding of
AGIs and their corresponding textual prompts. An effective and efficient
image-prompt fusion module, along with a novel special [QA] token, are also
applied. Both are plug-and-play and beneficial for the cooperation of image and
its corresponding prompt. Experiments demonstrate that our IP-IQA achieves the
state-of-the-art on AGIQA-1k and AGIQA-3k datasets. Code will be available.
comment: 6 pages, 3 figures, accepted by ICME2024
☆ SAT-NGP : Unleashing Neural Graphics Primitives for Fast Relightable Transient-Free 3D reconstruction from Satellite Imagery
Current stereo-vision pipelines produce high accuracy 3D reconstruction when
using multiple pairs or triplets of satellite images. However, these pipelines
are sensitive to the changes between images that can occur as a result of
multi-date acquisitions. Such variations are mainly due to variable shadows,
reflexions and transient objects (cars, vegetation). To take such changes into
account, Neural Radiance Fields (NeRF) have recently been applied to multi-date
satellite imagery. However, Neural methods are very compute-intensive, taking
dozens of hours to learn, compared with minutes for standard stereo-vision
pipelines. Following the ideas of Instant Neural Graphics Primitives we propose
to use an efficient sampling strategy and multi-resolution hash encoding to
accelerate the learning. Our model, Satellite Neural Graphics Primitives
(SAT-NGP) decreases the learning time to 15 minutes while maintaining the
quality of the 3D reconstruction.
comment: 5 pages, 3 figures, 1 table; Accepted to International Geoscience and
Remote Sensing Symposium (IGARSS) 2024; Code available at
https://github.com/Ellimac0/SAT-NGP
☆ Dense Vision Transformer Compression with Few Samples CVPR 2024
Few-shot model compression aims to compress a large model into a more compact
one with only a tiny training set (even without labels). Block-level pruning
has recently emerged as a leading technique in achieving high accuracy and low
latency in few-shot CNN compression. But, few-shot compression for Vision
Transformers (ViT) remains largely unexplored, which presents a new challenge.
In particular, the issue of sparse compression exists in traditional CNN
few-shot methods, which can only produce very few compressed models of
different model sizes. This paper proposes a novel framework for few-shot ViT
compression named DC-ViT. Instead of dropping the entire block, DC-ViT
selectively eliminates the attention module while retaining and reusing
portions of the MLP module. DC-ViT enables dense compression, which outputs
numerous compressed models that densely populate the range of model complexity.
DC-ViT outperforms state-of-the-art few-shot compression methods by a
significant margin of 10 percentage points, along with lower latency in the
compression of ViT and its variants.
comment: Accepted to CVPR 2024. Note: Jianxin Wu is a contributing author for
the arXiv version of this paper but is not listed as an author in the CVPR
version due to his role as Program Chair
☆ Annolid: Annotate, Segment, and Track Anything You Need
Annolid is a deep learning-based software package designed for the
segmentation, labeling, and tracking of research targets within video files,
focusing primarily on animal behavior analysis. Based on state-of-the-art
instance segmentation methods, Annolid now harnesses the Cutie video object
segmentation model to achieve resilient, markerless tracking of multiple
animals from single annotated frames, even in environments in which they may be
partially or entirely concealed by environmental features or by one another.
Our integration of Segment Anything and Grounding-DINO strategies additionally
enables the automatic masking and segmentation of recognizable animals and
objects by text command, removing the need for manual annotation. Annolid's
comprehensive approach to object segmentation flexibly accommodates a broad
spectrum of behavior analysis applications, enabling the classification of
diverse behavioral states such as freezing, digging, pup huddling, and social
interactions in addition to the tracking of animals and their body parts.
☆ Deep Learning for Robust and Explainable Models in Computer Vision
Recent breakthroughs in machine and deep learning (ML and DL) research have
provided excellent tools for leveraging enormous amounts of data and optimizing
huge models with millions of parameters to obtain accurate networks for image
processing. These developments open up tremendous opportunities for using
artificial intelligence (AI) in the automation and human assisted AI industry.
However, as more and more models are deployed and used in practice, many
challenges have emerged. This thesis presents various approaches that address
robustness and explainability challenges for using ML and DL in practice.
Robustness and reliability are the critical components of any model before
certification and deployment in practice. Deep convolutional neural networks
(CNNs) exhibit vulnerability to transformations of their inputs, such as
rotation and scaling, or intentional manipulations as described in the
adversarial attack literature. In addition, building trust in AI-based models
requires a better understanding of current models and developing methods that
are more explainable and interpretable a priori.
This thesis presents developments in computer vision models' robustness and
explainability. Furthermore, this thesis offers an example of using vision
models' feature response visualization (models' interpretations) to improve
robustness despite interpretability and robustness being seemingly unrelated in
the related research. Besides methodological developments for robust and
explainable vision models, a key message of this thesis is introducing model
interpretation techniques as a tool for understanding vision models and
improving their design and robustness. In addition to the theoretical
developments, this thesis demonstrates several applications of ML and DL in
different contexts, such as medical imaging and affective computing.
comment: 150 pages, 37 figures, 12 tables
☆ InstructBrush: Learning Attention-based Instruction Optimization for Image Editing
Ruoyu Zhao, Qingnan Fan, Fei Kou, Shuai Qin, Hong Gu, Wei Wu, Pengcheng Xu, Mingrui Zhu, Nannan Wang, Xinbo Gao
In recent years, instruction-based image editing methods have garnered
significant attention in image editing. However, despite encompassing a wide
range of editing priors, these methods are helpless when handling editing tasks
that are challenging to accurately describe through language. We propose
InstructBrush, an inversion method for instruction-based image editing methods
to bridge this gap. It extracts editing effects from exemplar image pairs as
editing instructions, which are further applied for image editing. Two key
techniques are introduced into InstructBrush, Attention-based Instruction
Optimization and Transformation-oriented Instruction Initialization, to address
the limitations of the previous method in terms of inversion effects and
instruction generalization. To explore the ability of instruction inversion
methods to guide image editing in open scenarios, we establish a
TransformationOriented Paired Benchmark (TOP-Bench), which contains a rich set
of scenes and editing types. The creation of this benchmark paves the way for
further exploration of instruction inversion. Quantitatively and qualitatively,
our approach achieves superior performance in editing and is more semantically
consistent with the target editing effects.
comment: Project Page: https://royzhao926.github.io/InstructBrush/
☆ Addressing Data Annotation Challenges in Multiple Sensors: A Solution for Scania Collected Datasets
Ajinkya Khoche, Aron Asefaw, Alejandro Gonzalez, Bogdan Timus, Sina Sharif Mansouri, Patric Jensfelt
Data annotation in autonomous vehicles is a critical step in the development
of Deep Neural Network (DNN) based models or the performance evaluation of the
perception system. This often takes the form of adding 3D bounding boxes on
time-sequential and registered series of point-sets captured from active
sensors like Light Detection and Ranging (LiDAR) and Radio Detection and
Ranging (RADAR). When annotating multiple active sensors, there is a need to
motion compensate and translate the points to a consistent coordinate frame and
timestamp respectively. However, highly dynamic objects pose a unique
challenge, as they can appear at different timestamps in each sensor's data.
Without knowing the speed of the objects, their position appears to be
different in different sensor outputs. Thus, even after motion compensation,
highly dynamic objects are not matched from multiple sensors in the same frame,
and human annotators struggle to add unique bounding boxes that capture all
objects. This article focuses on addressing this challenge, primarily within
the context of Scania collected datasets. The proposed solution takes a track
of an annotated object as input and uses the Moving Horizon Estimation (MHE) to
robustly estimate its speed. The estimated speed profile is utilized to correct
the position of the annotated box and add boxes to object clusters missed by
the original annotation.
comment: Accepted to European Control Conference 2024
☆ Transformers-based architectures for stroke segmentation: A review
Stroke remains a significant global health concern, necessitating precise and
efficient diagnostic tools for timely intervention and improved patient
outcomes. The emergence of deep learning methodologies has transformed the
landscape of medical image analysis. Recently, Transformers, initially designed
for natural language processing, have exhibited remarkable capabilities in
various computer vision applications, including medical image analysis. This
comprehensive review aims to provide an in-depth exploration of the
cutting-edge Transformer-based architectures applied in the context of stroke
segmentation. It commences with an exploration of stroke pathology, imaging
modalities, and the challenges associated with accurate diagnosis and
segmentation. Subsequently, the review delves into the fundamental ideas of
Transformers, offering detailed insights into their architectural intricacies
and the underlying mechanisms that empower them to effectively capture complex
spatial information within medical images. The existing literature is
systematically categorized and analyzed, discussing various approaches that
leverage Transformers for stroke segmentation. A critical assessment is
provided, highlighting the strengths and limitations of these methods,
including considerations of performance and computational efficiency.
Additionally, this review explores potential avenues for future research and
development
☆ FlexEdit: Flexible and Controllable Diffusion-based Object-centric Image Editing
Our work addresses limitations seen in previous approaches for object-centric
editing problems, such as unrealistic results due to shape discrepancies and
limited control in object replacement or insertion. To this end, we introduce
FlexEdit, a flexible and controllable editing framework for objects where we
iteratively adjust latents at each denoising step using our FlexEdit block.
Initially, we optimize latents at test time to align with specified object
constraints. Then, our framework employs an adaptive mask, automatically
extracted during denoising, to protect the background while seamlessly blending
new content into the target image. We demonstrate the versatility of FlexEdit
in various object editing tasks and curate an evaluation test suite with
samples from both real and synthetic images, along with novel evaluation
metrics designed for object-centric editing. We conduct extensive experiments
on different editing scenarios, demonstrating the superiority of our editing
framework over recent advanced text-guided image editing methods. Our project
page is published at https://flex-edit.github.io/.
comment: Our project page: https://flex-edit.github.io/
☆ RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos
Procedure Planning in instructional videos entails generating a sequence of
action steps based on visual observations of the initial and target states.
Despite the rapid progress in this task, there remain several critical
challenges to be solved: (1) Adaptive procedures: Prior works hold an
unrealistic assumption that the number of action steps is known and fixed,
leading to non-generalizable models in real-world scenarios where the sequence
length varies. (2) Temporal relation: Understanding the step temporal relation
knowledge is essential in producing reasonable and executable plans. (3)
Annotation cost: Annotating instructional videos with step-level labels (i.e.,
timestamp) or sequence-level labels (i.e., action category) is demanding and
labor-intensive, limiting its generalizability to large-scale datasets.In this
work, we propose a new and practical setting, called adaptive procedure
planning in instructional videos, where the procedure length is not fixed or
pre-determined. To address these challenges we introduce Retrieval-Augmented
Planner (RAP) model. Specifically, for adaptive procedures, RAP adaptively
determines the conclusion of actions using an auto-regressive model
architecture. For temporal relation, RAP establishes an external memory module
to explicitly retrieve the most relevant state-action pairs from the training
videos and revises the generated procedures. To tackle high annotation cost,
RAP utilizes a weakly-supervised learning manner to expand the training dataset
to other task-relevant, unannotated videos by generating pseudo labels for
action steps. Experiments on CrossTask and COIN benchmarks show the superiority
of RAP over traditional fixed-length models, establishing it as a strong
baseline solution for adaptive procedure planning.
comment: 23 pages, 6 figures, 12 tables
☆ Homogeneous Tokenizer Matters: Homogeneous Visual Tokenizer for Remote Sensing Image Understanding
The tokenizer, as one of the fundamental components of large models, has long
been overlooked or even misunderstood in visual tasks. One key factor of the
great comprehension power of the large language model is that natural language
tokenizers utilize meaningful words or subwords as the basic elements of
language. In contrast, mainstream visual tokenizers, represented by patch-based
methods such as Patch Embed, rely on meaningless rectangular patches as basic
elements of vision, which cannot serve as effectively as words or subwords in
language. Starting from the essence of the tokenizer, we defined semantically
independent regions (SIRs) for vision. We designed a simple HOmogeneous visual
tOKenizer: HOOK. HOOK mainly consists of two modules: the Object Perception
Module (OPM) and the Object Vectorization Module (OVM). To achieve homogeneity,
the OPM splits the image into 4*4 pixel seeds and then utilizes the attention
mechanism to perceive SIRs. The OVM employs cross-attention to merge seeds
within the same SIR. To achieve adaptability, the OVM defines a variable number
of learnable vectors as cross-attention queries, allowing for the adjustment of
token quantity. We conducted experiments on the NWPU-RESISC45, WHU-RS19
classification dataset, and GID5 segmentation dataset for sparse and dense
tasks. The results demonstrate that the visual tokens obtained by HOOK
correspond to individual objects, which demonstrates homogeneity. HOOK
outperformed Patch Embed by 6\% and 10\% in the two tasks and achieved
state-of-the-art performance compared to the baselines used for comparison.
Compared to Patch Embed, which requires more than one hundred tokens for one
image, HOOK requires only 6 and 8 tokens for sparse and dense tasks,
respectively, resulting in efficiency improvements of 1.5 to 2.8 times. The
code is available at https://github.com/GeoX-Lab/Hook.
comment: 20 pages, 8 figures, 6 tables
☆ Users prefer Jpegli over same-sized libjpeg-turbo or MozJPEG
We performed pairwise comparisons by human raters of JPEG images from
MozJPEG, libjpeg-turbo and our new Jpegli encoder. When compressing images at a
quality similar to libjpeg-turbo quality 95, the Jpegli images were 54% likely
to be preferred over both libjpeg-turbo and MozJPEG images, but used only 2.8
bits per pixel compared to libjpeg-turbo and MozJPEG that used 3.8 and 3.5 bits
per pixel respectively. The raw ratings and source images are publicly
available for further analysis and study.
☆ The Impact of Uniform Inputs on Activation Sparsity and Energy-Latency Attacks in Computer Vision SP 2024
Resource efficiency plays an important role for machine learning nowadays.
The energy and decision latency are two critical aspects to ensure a
sustainable and practical application. Unfortunately, the energy consumption
and decision latency are not robust against adversaries. Researchers have
recently demonstrated that attackers can compute and submit so-called sponge
examples at inference time to increase the energy consumption and decision
latency of neural networks. In computer vision, the proposed strategy crafts
inputs with less activation sparsity which could otherwise be used to
accelerate the computation. In this paper, we analyze the mechanism how these
energy-latency attacks reduce activation sparsity. In particular, we find that
input uniformity is a key enabler. A uniform image, that is, an image with
mostly flat, uniformly colored surfaces, triggers more activations due to a
specific interplay of convolution, batch normalization, and ReLU activation.
Based on these insights, we propose two new simple, yet effective strategies
for crafting sponge examples: sampling images from a probability distribution
and identifying dense, yet inconspicuous inputs in natural datasets. We
empirically examine our findings in a comprehensive evaluation with multiple
image classification models and show that our attack achieves the same sparsity
effect as prior sponge-example methods, but at a fraction of computation
effort. We also show that our sponge examples transfer between different neural
networks. Finally, we discuss applications of our findings for the good by
improving efficiency by increasing sparsity.
comment: Accepted at the DLSP 2024
☆ HandBooster: Boosting 3D Hand-Mesh Reconstruction by Conditional Synthesis and Sampling of Hand-Object Interactions
Reconstructing 3D hand mesh robustly from a single image is very challenging,
due to the lack of diversity in existing real-world datasets. While data
synthesis helps relieve the issue, the syn-to-real gap still hinders its usage.
In this work, we present HandBooster, a new approach to uplift the data
diversity and boost the 3D hand-mesh reconstruction performance by training a
conditional generative space on hand-object interactions and purposely sampling
the space to synthesize effective data samples. First, we construct versatile
content-aware conditions to guide a diffusion model to produce realistic images
with diverse hand appearances, poses, views, and backgrounds; favorably,
accurate 3D annotations are obtained for free. Then, we design a novel
condition creator based on our similarity-aware distribution sampling
strategies to deliberately find novel and realistic interaction poses that are
distinctive from the training set. Equipped with our method, several baselines
can be significantly improved beyond the SOTA on the HO3D and DexYCB
benchmarks. Our code will be released on
https://github.com/hxwork/HandBooster_Pytorch.
☆ Artifact Reduction in 3D and 4D Cone-beam Computed Tomography Images with Deep Learning -- A Review
Deep learning based approaches have been used to improve image quality in
cone-beam computed tomography (CBCT), a medical imaging technique often used in
applications such as image-guided radiation therapy, implant dentistry or
orthopaedics. In particular, while deep learning methods have been applied to
reduce various types of CBCT image artifacts arising from motion, metal
objects, or low-dose acquisition, a comprehensive review summarizing the
successes and shortcomings of these approaches, with a primary focus on the
type of artifacts rather than the architecture of neural networks, is lacking
in the literature. In this review, the data generation and simulation
pipelines, and artifact reduction techniques are specifically investigated for
each type of artifact. We provide an overview of deep learning techniques that
have successfully been shown to reduce artifacts in 3D, as well as in
time-resolved (4D) CBCT through the use of projection- and/or volume-domain
optimizations, or by introducing neural networks directly within the CBCT
reconstruction algorithms. Research gaps are identified to suggest avenues for
future exploration. One of the key findings of this work is an observed trend
towards the use of generative models including GANs and score-based or
diffusion models, accompanied with the need for more diverse and open training
datasets and simulations.
comment: 16 pages, 4 figures, 1 Table, published in IEEE Access Journal
☆ CosalPure: Learning Concept from Group Images for Robust Co-Saliency Detection
Co-salient object detection (CoSOD) aims to identify the common and salient
(usually in the foreground) regions across a given group of images. Although
achieving significant progress, state-of-the-art CoSODs could be easily
affected by some adversarial perturbations, leading to substantial accuracy
reduction. The adversarial perturbations can mislead CoSODs but do not change
the high-level semantic information (e.g., concept) of the co-salient objects.
In this paper, we propose a novel robustness enhancement framework by first
learning the concept of the co-salient objects based on the input group images
and then leveraging this concept to purify adversarial perturbations, which are
subsequently fed to CoSODs for robustness enhancement. Specifically, we propose
CosalPure containing two modules, i.e., group-image concept learning and
concept-guided diffusion purification. For the first module, we adopt a
pre-trained text-to-image diffusion model to learn the concept of co-salient
objects within group images where the learned concept is robust to adversarial
examples. For the second module, we map the adversarial image to the latent
space and then perform diffusion generation by embedding the learned concept
into the noise prediction function as an extra condition. Our method can
effectively alleviate the influence of the SOTA adversarial attack containing
different adversarial patterns, including exposure and noise. The extensive
results demonstrate that our method could enhance the robustness of CoSODs
significantly.
comment: 8 pages
☆ Attention Calibration for Disentangled Text-to-Image Personalization CVPR 2024
Recent thrilling progress in large-scale text-to-image (T2I) models has
unlocked unprecedented synthesis quality of AI-generated content (AIGC)
including image generation, 3D and video composition. Further, personalized
techniques enable appealing customized production of a novel concept given only
several images as reference. However, an intriguing problem persists: Is it
possible to capture multiple, novel concepts from one single reference image?
In this paper, we identify that existing approaches fail to preserve visual
consistency with the reference image and eliminate cross-influence from
concepts. To alleviate this, we propose an attention calibration mechanism to
improve the concept-level understanding of the T2I model. Specifically, we
first introduce new learnable modifiers bound with classes to capture
attributes of multiple concepts. Then, the classes are separated and
strengthened following the activation of the cross-attention operation,
ensuring comprehensive and self-contained concepts. Additionally, we suppress
the attention activation of different classes to mitigate mutual influence
among concepts. Together, our proposed method, dubbed DisenDiff, can learn
disentangled multiple concepts from one single image and produce novel
customized images with learned concepts. We demonstrate that our method
outperforms the current state of the art in both qualitative and quantitative
evaluations. More importantly, our proposed techniques are compatible with LoRA
and inpainting pipelines, enabling more interactive experiences.
comment: Accepted to CVPR 2024
☆ OrCo: Towards Better Generalization via Orthogonality and Contrast for Few-Shot Class-Incremental Learning
Few-Shot Class-Incremental Learning (FSCIL) introduces a paradigm in which
the problem space expands with limited data. FSCIL methods inherently face the
challenge of catastrophic forgetting as data arrives incrementally, making
models susceptible to overwriting previously acquired knowledge. Moreover,
given the scarcity of labeled samples available at any given time, models may
be prone to overfitting and find it challenging to strike a balance between
extensive pretraining and the limited incremental data. To address these
challenges, we propose the OrCo framework built on two core principles:
features' orthogonality in the representation space, and contrastive learning.
In particular, we improve the generalization of the embedding space by
employing a combination of supervised and self-supervised contrastive losses
during the pretraining phase. Additionally, we introduce OrCo loss to address
challenges arising from data limitations during incremental sessions. Through
feature space perturbations and orthogonality between classes, the OrCo loss
maximizes margins and reserves space for the following incremental data. This,
in turn, ensures the accommodation of incoming classes in the feature space
without compromising previously acquired knowledge. Our experimental results
showcase state-of-the-art performance across three benchmark datasets,
including mini-ImageNet, CIFAR100, and CUB datasets. Code is available at
https://github.com/noorahmedds/OrCo
☆ A Semi-supervised Nighttime Dehazing Baseline with Spatial-Frequency Aware and Realistic Brightness Constraint CVPR2024
Existing research based on deep learning has extensively explored the problem
of daytime image dehazing. However, few studies have considered the
characteristics of nighttime hazy scenes. There are two distinctions between
nighttime and daytime haze. First, there may be multiple active colored light
sources with lower illumination intensity in nighttime scenes, which may cause
haze, glow and noise with localized, coupled and frequency inconsistent
characteristics. Second, due to the domain discrepancy between simulated and
real-world data, unrealistic brightness may occur when applying a dehazing
model trained on simulated data to real-world data. To address the above two
issues, we propose a semi-supervised model for real-world nighttime dehazing.
First, the spatial attention and frequency spectrum filtering are implemented
as a spatial-frequency domain information interaction module to handle the
first issue. Second, a pseudo-label-based retraining strategy and a local
window-based brightness loss for semi-supervised training process is designed
to suppress haze and glow while achieving realistic brightness. Experiments on
public benchmarks validate the effectiveness of the proposed method and its
superiority over state-of-the-art methods. The source code and Supplementary
Materials are placed in the https://github.com/Xiaofeng-life/SFSNiD.
comment: This paper is accepted by CVPR2024
☆ Efficient Heatmap-Guided 6-Dof Grasp Detection in Cluttered Scenes
Fast and robust object grasping in clutter is a crucial component of
robotics. Most current works resort to the whole observed point cloud for 6-Dof
grasp generation, ignoring the guidance information excavated from global
semantics, thus limiting high-quality grasp generation and real-time
performance. In this work, we show that the widely used heatmaps are
underestimated in the efficiency of 6-Dof grasp generation. Therefore, we
propose an effective local grasp generator combined with grasp heatmaps as
guidance, which infers in a global-to-local semantic-to-point way.
Specifically, Gaussian encoding and the grid-based strategy are applied to
predict grasp heatmaps as guidance to aggregate local points into graspable
regions and provide global semantic information. Further, a novel non-uniform
anchor sampling mechanism is designed to improve grasp accuracy and diversity.
Benefiting from the high-efficiency encoding in the image space and focusing on
points in local graspable regions, our framework can perform high-quality grasp
detection in real-time and achieve state-of-the-art results. In addition, real
robot experiments demonstrate the effectiveness of our method with a success
rate of 94% and a clutter completion rate of 100%. Our code is available at
https://github.com/THU-VCLab/HGGD.
comment: Extensive results on GraspNet-1B dataset
☆ Language Plays a Pivotal Role in the Object-Attribute Compositional Generalization of CLIP
Vision-language models, such as CLIP, have shown promising
Out-of-Distribution (OoD) generalization under various types of distribution
shifts. Recent studies attempted to investigate the leading cause of this
capability. In this work, we follow the same path, but focus on a specific type
of OoD data - images with novel compositions of attribute-object pairs - and
study whether such models can successfully classify those images into
composition classes. We carefully designed an authentic image test dataset
called ImageNet-AO, consisting of attributes for objects that are unlikely
encountered in the CLIP training sets. We found that CLIPs trained with large
datasets such as OpenAI CLIP, LAION-400M, and LAION-2B show orders-of-magnitude
improvement in effective compositional OoD generalization compared to both
supervised models and CLIPs trained with smaller datasets, such as CC-12M and
YFCC-15M. Our results provide evidence that the scale and diversity of training
data and language supervision play a key role in unlocking the compositional
generalization abilities of vision-language models.
comment: Oral accepted at OODCV 2023(http://www.ood-cv.org)
☆ CT-3DFlow : Leveraging 3D Normalizing Flows for Unsupervised Detection of Pathological Pulmonary CT scans
Aissam Djahnine, Alexandre Popoff, Emilien Jupin-Delevaux, Vincent Cottin, Olivier Nempont, Loic Boussel
Unsupervised pathology detection can be implemented by training a model on
healthy data only and measuring the deviation from the training set upon
inference, for example with CNN-based feature extraction and one-class
classifiers, or reconstruction-score-based methods such as AEs, GANs and
Diffusion models. Normalizing Flows (NF) have the ability to directly learn the
probability distribution of training examples through an invertible
architecture. We leverage this property in a novel 3D NF-based model named
CT-3DFlow, specifically tailored for patient-level pulmonary pathology
detection in chest CT data. Our model is trained unsupervised on healthy 3D
pulmonary CT patches, and detects deviations from its log-likelihood
distribution as anomalies. We aggregate patches-level likelihood values from a
patient's CT scan to provide a patient-level 'normal'/'abnormal' prediction.
Out-of-distribution detection performance is evaluated using expert annotations
on a separate chest CT test dataset, outperforming other state-of-the-art
methods.
☆ ParCo: Part-Coordinating Text-to-Motion Synthesis
We study a challenging task: text-to-motion synthesis, aiming to generate
motions that align with textual descriptions and exhibit coordinated movements.
Currently, the part-based methods introduce part partition into the motion
synthesis process to achieve finer-grained generation. However, these methods
encounter challenges such as the lack of coordination between different part
motions and difficulties for networks to understand part concepts. Moreover,
introducing finer-grained part concepts poses computational complexity
challenges. In this paper, we propose Part-Coordinating Text-to-Motion
Synthesis (ParCo), endowed with enhanced capabilities for understanding part
motions and communication among different part motion generators, ensuring a
coordinated and fined-grained motion synthesis. Specifically, we discretize
whole-body motion into multiple part motions to establish the prior concept of
different parts. Afterward, we employ multiple lightweight generators designed
to synthesize different part motions and coordinate them through our part
coordination module. Our approach demonstrates superior performance on common
benchmarks with economic computations, including HumanML3D and KIT-ML,
providing substantial evidence of its effectiveness. Code is available at
https://github.com/qrzou/ParCo .
☆ HEMIT: H&E to Multiplex-immunohistochemistry Image Translation with Dual-Branch Pix2pix Generator
Computational analysis of multiplexed immunofluorescence histology data is
emerging as an important method for understanding the tumour micro-environment
in cancer. This work presents HEMIT, a dataset designed for translating
Hematoxylin and Eosin (H&E) sections to multiplex-immunohistochemistry (mIHC)
images, featuring DAPI, CD3, and panCK markers. Distinctively, HEMIT's mIHC
images are multi-component and cellular-level aligned with H&E, enriching
supervised stain translation tasks. To our knowledge, HEMIT is the first
publicly available cellular-level aligned dataset that enables H&E to
multi-target mIHC image translation. This dataset provides the computer vision
community with a valuable resource to develop novel computational methods which
have the potential to gain new insights from H&E slide archives.
We also propose a new dual-branch generator architecture, using residual
Convolutional Neural Networks (CNNs) and Swin Transformers which achieves
better translation outcomes than other popular algorithms. When evaluated on
HEMIT, it outperforms pix2pixHD, pix2pix, U-Net, and ResNet, achieving the
highest overall score on key metrics including the Structural Similarity Index
Measure (SSIM), Pearson correlation score (R), and Peak signal-to-noise Ratio
(PSNR). Additionally, downstream analysis has been used to further validate the
quality of the generated mIHC images. These results set a new benchmark in the
field of stain translation tasks.
☆ Direct mineral content prediction from drill core images via transfer learning
Romana Boiger, Sergey V. Churakov, Ignacio Ballester Llagaria, Georg Kosakowski, Raphael Wüst, Nikolaos I. Prasianakis
Deep subsurface exploration is important for mining, oil and gas industries,
as well as in the assessment of geological units for the disposal of chemical
or nuclear waste, or the viability of geothermal energy systems. Typically,
detailed examinations of subsurface formations or units are performed on
cuttings or core materials extracted during drilling campaigns, as well as on
geophysical borehole data, which provide detailed information about the
petrophysical properties of the rocks. Depending on the volume of rock samples
and the analytical program, the laboratory analysis and diagnostics can be very
time-consuming. This study investigates the potential of utilizing machine
learning, specifically convolutional neural networks (CNN), to assess the
lithology and mineral content solely from analysis of drill core images, aiming
to support and expedite the subsurface geological exploration. The paper
outlines a comprehensive methodology, encompassing data preprocessing, machine
learning methods, and transfer learning techniques. The outcome reveals a
remarkable 96.7% accuracy in the classification of drill core segments into
distinct formation classes. Furthermore, a CNN model was trained for the
evaluation of mineral content using a learning data set from multidimensional
log analysis data (silicate, total clay, carbonate). When benchmarked against
laboratory XRD measurements on samples from the cores, both the advanced
multidimensional log analysis model and the neural network approach developed
here provide equally good performance. This work demonstrates that deep
learning and particularly transfer learning can support extracting
petrophysical properties, including mineral content and formation
classification, from drill core images, thus offering a road map for enhancing
model performance and data set quality in image-based analysis of drill cores.
☆ VersaT2I: Improving Text-to-Image Models with Versatile Reward
Jianshu Guo, Wenhao Chai, Jie Deng, Hsiang-Wei Huang, Tian Ye, Yichen Xu, Jiawei Zhang, Jenq-Neng Hwang, Gaoang Wang
Recent text-to-image (T2I) models have benefited from large-scale and
high-quality data, demonstrating impressive performance. However, these T2I
models still struggle to produce images that are aesthetically pleasing,
geometrically accurate, faithful to text, and of good low-level quality. We
present VersaT2I, a versatile training framework that can boost the performance
with multiple rewards of any T2I model. We decompose the quality of the image
into several aspects such as aesthetics, text-image alignment, geometry,
low-level quality, etc. Then, for every quality aspect, we select high-quality
images in this aspect generated by the model as the training set to finetune
the T2I model using the Low-Rank Adaptation (LoRA). Furthermore, we introduce a
gating function to combine multiple quality aspects, which can avoid conflicts
between different quality aspects. Our method is easy to extend and does not
require any manual annotation, reinforcement learning, or model architecture
changes. Extensive experiments demonstrate that VersaT2I outperforms the
baseline methods across various quality criteria.
☆ I2CKD : Intra- and Inter-Class Knowledge Distillation for Semantic Segmentation
This paper proposes a new knowledge distillation method tailored for image
semantic segmentation, termed Intra- and Inter-Class Knowledge Distillation
(I2CKD). The focus of this method is on capturing and transferring knowledge
between the intermediate layers of teacher (cumbersome model) and student
(compact model). For knowledge extraction, we exploit class prototypes derived
from feature maps. To facilitate knowledge transfer, we employ a triplet loss
in order to minimize intra-class variances and maximize inter-class variances
between teacher and student prototypes. Consequently, I2CKD enables the student
to better mimic the feature representation of the teacher for each class,
thereby enhancing the segmentation performance of the compact network.
Extensive experiments on three segmentation datasets, i.e., Cityscapes, Pascal
VOC and CamVid, using various teacher-student network pairs demonstrate the
effectiveness of the proposed method.
☆ Modeling uncertainty for Gaussian Splatting
We present Stochastic Gaussian Splatting (SGS): the first framework for
uncertainty estimation using Gaussian Splatting (GS). GS recently advanced the
novel-view synthesis field by achieving impressive reconstruction quality at a
fraction of the computational cost of Neural Radiance Fields (NeRF). However,
contrary to the latter, it still lacks the ability to provide information about
the confidence associated with their outputs. To address this limitation, in
this paper, we introduce a Variational Inference-based approach that seamlessly
integrates uncertainty prediction into the common rendering pipeline of GS.
Additionally, we introduce the Area Under Sparsification Error (AUSE) as a new
term in the loss function, enabling optimization of uncertainty estimation
alongside image reconstruction. Experimental results on the LLFF dataset
demonstrate that our method outperforms existing approaches in terms of both
image rendering quality and uncertainty estimation accuracy. Overall, our
framework equips practitioners with valuable insights into the reliability of
synthesized views, facilitating safer decision-making in real-world
applications.
☆ DiffusionFace: Towards a Comprehensive Dataset for Diffusion-Based Face Forgery Analysis
The rapid progress in deep learning has given rise to hyper-realistic facial
forgery methods, leading to concerns related to misinformation and security
risks. Existing face forgery datasets have limitations in generating
high-quality facial images and addressing the challenges posed by evolving
generative techniques. To combat this, we present DiffusionFace, the first
diffusion-based face forgery dataset, covering various forgery categories,
including unconditional and Text Guide facial image generation, Img2Img,
Inpaint, and Diffusion-based facial exchange algorithms. Our DiffusionFace
dataset stands out with its extensive collection of 11 diffusion models and the
high-quality of the generated images, providing essential metadata and a
real-world internet-sourced forgery facial image dataset for evaluation.
Additionally, we provide an in-depth analysis of the data and introduce
practical evaluation protocols to rigorously assess discriminative models'
effectiveness in detecting counterfeit facial images, aiming to enhance
security in facial image authentication processes. The dataset is available for
download at \url{https://github.com/Rapisurazurite/DiffFace}.
☆ Density-guided Translator Boosts Synthetic-to-Real Unsupervised Domain Adaptive Segmentation of 3D Point Clouds CVPR2024
3D synthetic-to-real unsupervised domain adaptive segmentation is crucial to
annotating new domains. Self-training is a competitive approach for this task,
but its performance is limited by different sensor sampling patterns (i.e.,
variations in point density) and incomplete training strategies. In this work,
we propose a density-guided translator (DGT), which translates point density
between domains, and integrates it into a two-stage self-training pipeline
named DGT-ST. First, in contrast to existing works that simultaneously conduct
data generation and feature/output alignment within unstable adversarial
training, we employ the non-learnable DGT to bridge the domain gap at the input
level. Second, to provide a well-initialized model for self-training, we
propose a category-level adversarial network in stage one that utilizes the
prototype to prevent negative transfer. Finally, by leveraging the designs
above, a domain-mixed self-training method with source-aware consistency loss
is proposed in stage two to narrow the domain gap further. Experiments on two
synthetic-to-real segmentation tasks (SynLiDAR $\rightarrow$ semanticKITTI and
SynLiDAR $\rightarrow$ semanticPOSS) demonstrate that DGT-ST outperforms
state-of-the-art methods, achieving 9.4$\%$ and 4.3$\%$ mIoU improvements,
respectively. Code is available at \url{https://github.com/yuan-zm/DGT-ST}.
comment: CVPR2024
☆ Deep Learning Segmentation and Classification of Red Blood Cells Using a Large Multi-Scanner Dataset
Digital pathology has recently been revolutionized by advancements in
artificial intelligence, deep learning, and high-performance computing. With
its advanced tools, digital pathology can help improve and speed up the
diagnostic process, reduce human errors, and streamline the reporting step. In
this paper, we report a new large red blood cell (RBC) image dataset and
propose a two-stage deep learning framework for RBC image segmentation and
classification. The dataset is a highly diverse dataset of more than 100K RBCs
containing eight different classes. The dataset, which is considerably larger
than any publicly available hematopathology dataset, was labeled independently
by two hematopathologists who also manually created masks for RBC cell
segmentation. Subsequently, in the proposed framework, first, a U-Net model was
trained to achieve automatic RBC image segmentation. Second, an EfficientNetB0
model was trained to classify RBC images into one of the eight classes using a
transfer learning approach with a 5X2 cross-validation scheme. An IoU of 98.03%
and an average classification accuracy of 96.5% were attained on the test set.
Moreover, we have performed experimental comparisons against several prominent
CNN models. These comparisons show the superiority of the proposed model with a
good balance between performance and computational cost.
comment: 15 pages, 12 figures, 8 tables
☆ DiffStyler: Diffusion-based Localized Image Style Transfer
Image style transfer aims to imbue digital imagery with the distinctive
attributes of style targets, such as colors, brushstrokes, shapes, whilst
concurrently preserving the semantic integrity of the content. Despite the
advancements in arbitrary style transfer methods, a prevalent challenge remains
the delicate equilibrium between content semantics and style attributes. Recent
developments in large-scale text-to-image diffusion models have heralded
unprecedented synthesis capabilities, albeit at the expense of relying on
extensive and often imprecise textual descriptions to delineate artistic
styles. Addressing these limitations, this paper introduces DiffStyler, a novel
approach that facilitates efficient and precise arbitrary image style transfer.
DiffStyler lies the utilization of a text-to-image Stable Diffusion model-based
LoRA to encapsulate the essence of style targets. This approach, coupled with
strategic cross-LoRA feature and attention injection, guides the style transfer
process. The foundation of our methodology is rooted in the observation that
LoRA maintains the spatial feature consistency of UNet, a discovery that
further inspired the development of a mask-wise style transfer technique. This
technique employs masks extracted through a pre-trained FastSAM model,
utilizing mask prompts to facilitate feature fusion during the denoising
process, thereby enabling localized style transfer that preserves the original
image's unaffected regions. Moreover, our approach accommodates multiple style
targets through the use of corresponding masks. Through extensive
experimentation, we demonstrate that DiffStyler surpasses previous methods in
achieving a more harmonious balance between content preservation and style
integration.
☆ Scaling Vision-and-Language Navigation With Offline RL
The study of vision-and-language navigation (VLN) has typically relied on
expert trajectories, which may not always be available in real-world situations
due to the significant effort required to collect them. On the other hand,
existing approaches to training VLN agents that go beyond available expert data
involve data augmentations or online exploration which can be tedious and
risky. In contrast, it is easy to access large repositories of suboptimal
offline trajectories. Inspired by research in offline reinforcement learning
(ORL), we introduce a new problem setup of VLN-ORL which studies VLN using
suboptimal demonstration data. We introduce a simple and effective
reward-conditioned approach that can account for dataset suboptimality for
training VLN agents, as well as benchmarks to evaluate progress and promote
research in this area. We empirically study various noise models for
characterizing dataset suboptimality among other unique challenges in VLN-ORL
and instantiate it for the VLN$\circlearrowright$BERT and MTVM architectures in
the R2R and RxR environments. Our experiments demonstrate that the proposed
reward-conditioned approach leads to significant performance improvements, even
in complex and intricate environments.
comment: Published in Transactions on Machine Learning Research (04/2024)
☆ SingularTrajectory: Universal Trajectory Predictor Using Diffusion Model CVPR 2024
There are five types of trajectory prediction tasks: deterministic,
stochastic, domain adaptation, momentary observation, and few-shot. These
associated tasks are defined by various factors, such as the length of input
paths, data split and pre-processing methods. Interestingly, even though they
commonly take sequential coordinates of observations as input and infer future
paths in the same coordinates as output, designing specialized architectures
for each task is still necessary. For the other task, generality issues can
lead to sub-optimal performances. In this paper, we propose SingularTrajectory,
a diffusion-based universal trajectory prediction framework to reduce the
performance gap across the five tasks. The core of SingularTrajectory is to
unify a variety of human dynamics representations on the associated tasks. To
do this, we first build a Singular space to project all types of motion
patterns from each task into one embedding space. We next propose an adaptive
anchor working in the Singular space. Unlike traditional fixed anchor methods
that sometimes yield unacceptable paths, our adaptive anchor enables correct
anchors, which are put into a wrong location, based on a traversability map.
Finally, we adopt a diffusion-based predictor to further enhance the prototype
paths using a cascaded denoising process. Our unified framework ensures the
generality across various benchmark settings such as input modality, and
trajectory lengths. Extensive experiments on five public benchmarks demonstrate
that SingularTrajectory substantially outperforms existing models, highlighting
its effectiveness in estimating general dynamics of human movements. Code is
publicly available at https://github.com/inhwanbae/SingularTrajectory .
comment: Accepted at CVPR 2024
☆ Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction CVPR 2024
Language models have demonstrated impressive ability in context understanding
and generative performance. Inspired by the recent success of language
foundation models, in this paper, we propose LMTraj (Language-based Multimodal
Trajectory predictor), which recasts the trajectory prediction task into a sort
of question-answering problem. Departing from traditional numerical regression
models, which treat the trajectory coordinate sequence as continuous signals,
we consider them as discrete signals like text prompts. Specially, we first
transform an input space for the trajectory coordinate into the natural
language space. Here, the entire time-series trajectories of pedestrians are
converted into a text prompt, and scene images are described as text
information through image captioning. The transformed numerical and image data
are then wrapped into the question-answering template for use in a language
model. Next, to guide the language model in understanding and reasoning
high-level knowledge, such as scene context and social relationships between
pedestrians, we introduce an auxiliary multi-task question and answering. We
then train a numerical tokenizer with the prompt data. We encourage the
tokenizer to separate the integer and decimal parts well, and leverage it to
capture correlations between the consecutive numbers in the language model.
Lastly, we train the language model using the numerical tokenizer and all of
the question-answer prompts. Here, we propose a beam-search-based most-likely
prediction and a temperature-based multimodal prediction to implement both
deterministic and stochastic inferences. Applying our LMTraj, we show that the
language-based model can be a powerful pedestrian trajectory predictor, and
outperforms existing numerical-based predictor methods. Code is publicly
available at https://github.com/inhwanbae/LMTrajectory .
comment: Accepted at CVPR 2024
☆ $\mathrm{F^2Depth}$: Self-supervised Indoor Monocular Depth Estimation via Optical Flow Consistency and Feature Map Synthesis
Self-supervised monocular depth estimation methods have been increasingly
given much attention due to the benefit of not requiring large, labelled
datasets. Such self-supervised methods require high-quality salient features
and consequently suffer from severe performance drop for indoor scenes, where
low-textured regions dominant in the scenes are almost indiscriminative. To
address the issue, we propose a self-supervised indoor monocular depth
estimation framework called $\mathrm{F^2Depth}$. A self-supervised optical flow
estimation network is introduced to supervise depth learning. To improve
optical flow estimation performance in low-textured areas, only some patches of
points with more discriminative features are adopted for finetuning based on
our well-designed patch-based photometric loss. The finetuned optical flow
estimation network generates high-accuracy optical flow as a supervisory signal
for depth estimation. Correspondingly, an optical flow consistency loss is
designed. Multi-scale feature maps produced by finetuned optical flow
estimation network perform warping to compute feature map synthesis loss as
another supervisory signal for depth learning. Experimental results on the NYU
Depth V2 dataset demonstrate the effectiveness of the framework and our
proposed losses. To evaluate the generalization ability of our
$\mathrm{F^2Depth}$, we collect a Campus Indoor depth dataset composed of
approximately 1500 points selected from 99 images in 18 scenes. Zero-shot
generalization experiments on 7-Scenes dataset and Campus Indoor achieve
$\delta_1$ accuracy of 75.8% and 76.0% respectively. The accuracy results show
that our model can generalize well to monocular images captured in unknown
indoor scenes.
☆ Backpropagation-free Network for 3D Test-time Adaptation CVPR 2024
Yanshuo Wang, Ali Cheraghian, Zeeshan Hayder, Jie Hong, Sameera Ramasinghe, Shafin Rahman, David Ahmedt-Aristizabal, Xuesong Li, Lars Petersson, Mehrtash Harandi
Real-world systems often encounter new data over time, which leads to
experiencing target domain shifts. Existing Test-Time Adaptation (TTA) methods
tend to apply computationally heavy and memory-intensive backpropagation-based
approaches to handle this. Here, we propose a novel method that uses a
backpropagation-free approach for TTA for the specific case of 3D data. Our
model uses a two-stream architecture to maintain knowledge about the source
domain as well as complementary target-domain-specific information. The
backpropagation-free property of our model helps address the well-known
forgetting problem and mitigates the error accumulation issue. The proposed
method also eliminates the need for the usually noisy process of
pseudo-labeling and reliance on costly self-supervised training. Moreover, our
method leverages subspace learning, effectively reducing the distribution
variance between the two domains. Furthermore, the source-domain-specific and
the target-domain-specific streams are aligned using a novel entropy-based
adaptive fusion strategy. Extensive experiments on popular benchmarks
demonstrate the effectiveness of our method. The code will be available at
https://github.com/abie-e/BFTT3D.
comment: CVPR 2024
☆ U-Sketch: An Efficient Approach for Sketch to Image Diffusion Models
Diffusion models have demonstrated remarkable performance in text-to-image
synthesis, producing realistic and high resolution images that faithfully
adhere to the corresponding text-prompts. Despite their great success, they
still fall behind in sketch-to-image synthesis tasks, where in addition to
text-prompts, the spatial layout of the generated images has to closely follow
the outlines of certain reference sketches. Employing an MLP latent edge
predictor to guide the spatial layout of the synthesized image by predicting
edge maps at each denoising step has been recently proposed. Despite yielding
promising results, the pixel-wise operation of the MLP does not take into
account the spatial layout as a whole, and demands numerous denoising
iterations to produce satisfactory images, leading to time inefficiency. To
this end, we introduce U-Sketch, a framework featuring a U-Net type latent edge
predictor, which is capable of efficiently capturing both local and global
features, as well as spatial correlations between pixels. Moreover, we propose
the addition of a sketch simplification network that offers the user the choice
of preprocessing and simplifying input sketches for enhanced outputs. The
experimental results, corroborated by user feedback, demonstrate that our
proposed U-Net latent edge predictor leads to more realistic results, that are
better aligned with the spatial outlines of the reference sketches, while
drastically reducing the number of required denoising steps and, consequently,
the overall execution time.
☆ ECNet: Effective Controllable Text-to-Image Diffusion Models
Sicheng Li, Keqiang Sun, Zhixin Lai, Xiaoshi Wu, Feng Qiu, Haoran Xie, Kazunori Miyata, Hongsheng Li
The conditional text-to-image diffusion models have garnered significant
attention in recent years. However, the precision of these models is often
compromised mainly for two reasons, ambiguous condition input and inadequate
condition guidance over single denoising loss. To address the challenges, we
introduce two innovative solutions. Firstly, we propose a Spatial Guidance
Injector (SGI) which enhances conditional detail by encoding text inputs with
precise annotation information. This method directly tackles the issue of
ambiguous control inputs by providing clear, annotated guidance to the model.
Secondly, to overcome the issue of limited conditional supervision, we
introduce Diffusion Consistency Loss (DCL), which applies supervision on the
denoised latent code at any given time step. This encourages consistency
between the latent code at each time step and the input signal, thereby
enhancing the robustness and accuracy of the output. The combination of SGI and
DCL results in our Effective Controllable Network (ECNet), which offers a more
accurate controllable end-to-end text-to-image generation framework with a more
precise conditioning input and stronger controllable supervision. We validate
our approach through extensive experiments on generation under various
conditions, such as human body skeletons, facial landmarks, and sketches of
general objects. The results consistently demonstrate that our method
significantly enhances the controllability and robustness of the generated
images, outperforming existing state-of-the-art controllable text-to-image
models.
☆ A Channel-ensemble Approach: Unbiased and Low-variance Pseudo-labels is Critical for Semi-supervised Classification
Semi-supervised learning (SSL) is a practical challenge in computer vision.
Pseudo-label (PL) methods, e.g., FixMatch and FreeMatch, obtain the State Of
The Art (SOTA) performances in SSL. These approaches employ a
threshold-to-pseudo-label (T2L) process to generate PLs by truncating the
confidence scores of unlabeled data predicted by the self-training method.
However, self-trained models typically yield biased and high-variance
predictions, especially in the scenarios when a little labeled data are
supplied. To address this issue, we propose a lightweight channel-based
ensemble method to effectively consolidate multiple inferior PLs into the
theoretically guaranteed unbiased and low-variance one. Importantly, our
approach can be readily extended to any SSL framework, such as FixMatch or
FreeMatch. Experimental results demonstrate that our method significantly
outperforms state-of-the-art techniques on CIFAR10/100 in terms of
effectiveness and efficiency.
☆ An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
Stimulated by the sophisticated reasoning capabilities of recent Large
Language Models (LLMs), a variety of strategies for bridging video modality
have been devised. A prominent strategy involves Video Language Models
(VideoLMs), which train a learnable interface with video data to connect
advanced vision encoders with LLMs. Recently, an alternative strategy has
surfaced, employing readily available foundation models, such as VideoLMs and
LLMs, across multiple stages for modality bridging. In this study, we introduce
a simple yet novel strategy where only a single Vision Language Model (VLM) is
utilized. Our starting point is the plain insight that a video comprises a
series of images, or frames, interwoven with temporal information. The essence
of video comprehension lies in adeptly managing the temporal aspects along with
the spatial details of each frame. Initially, we transform a video into a
single composite image by arranging multiple frames in a grid layout. The
resulting single image is termed as an image grid. This format, while
maintaining the appearance of a solitary image, effectively retains temporal
information within the grid structure. Therefore, the image grid approach
enables direct application of a single high-performance VLM without
necessitating any video-data training. Our extensive experimental analysis
across ten zero-shot video question answering benchmarks, including five
open-ended and five multiple-choice benchmarks, reveals that the proposed Image
Grid Vision Language Model (IG-VLM) surpasses the existing methods in nine out
of ten benchmarks.
comment: Our code is available at https://github.com/imagegridworth/IG-VLM
☆ Colour and Brush Stroke Pattern Recognition in Abstract Art using Modified Deep Convolutional Generative Adversarial Networks
Abstract Art is an immensely popular, discussed form of art that often has
the ability to depict the emotions of an artist. Many researchers have made
attempts to study abstract art in the form of edge detection, brush stroke and
emotion recognition algorithms using machine and deep learning. This papers
describes the study of a wide distribution of abstract paintings using
Generative Adversarial Neural Networks(GAN). GANs have the ability to learn and
reproduce a distribution enabling researchers and scientists to effectively
explore and study the generated image space. However, the challenge lies in
developing an efficient GAN architecture that overcomes common training
pitfalls. This paper addresses this challenge by introducing a modified-DCGAN
(mDCGAN) specifically designed for high-quality artwork generation. The
approach involves a thorough exploration of the modifications made, delving
into the intricate workings of DCGANs, optimisation techniques, and
regularisation methods aimed at improving stability and realism in art
generation enabling effective study of generated patterns. The proposed mDCGAN
incorporates meticulous adjustments in layer configurations and architectural
choices, offering tailored solutions to the unique demands of art generation
while effectively combating issues like mode collapse and gradient vanishing.
Further this paper explores the generated latent space by performing random
walks to understand vector relationships between brush strokes and colours in
the abstract art space and a statistical analysis of unstable outputs after a
certain period of GAN training and compare its significant difference. These
findings validate the effectiveness of the proposed approach, emphasising its
potential to revolutionise the field of digital art generation and digital art
ecosystem.
comment: 28 pages, 5 tables, 7 figures
☆ FTBC: Forward Temporal Bias Correction for Optimizing ANN-SNN Conversion
Spiking Neural Networks (SNNs) offer a promising avenue for energy-efficient
computing compared with Artificial Neural Networks (ANNs), closely mirroring
biological neural processes. However, this potential comes with inherent
challenges in directly training SNNs through spatio-temporal backpropagation --
stemming from the temporal dynamics of spiking neurons and their discrete
signal processing -- which necessitates alternative ways of training, most
notably through ANN-SNN conversion. In this work, we introduce a lightweight
Forward Temporal Bias Correction (FTBC) technique, aimed at enhancing
conversion accuracy without the computational overhead. We ground our method on
provided theoretical findings that through proper temporal bias calibration the
expected error of ANN-SNN conversion can be reduced to be zero after each time
step. We further propose a heuristic algorithm for finding the temporal bias
only in the forward pass, thus eliminating the computational burden of
backpropagation and we evaluate our method on CIFAR-10/100 and ImageNet
datasets, achieving a notable increase in accuracy on all datasets. Codes are
released at a GitHub repository.
☆ Generative Multi-modal Models are Good Class-Incremental Learners CVPR 2024
In class-incremental learning (CIL) scenarios, the phenomenon of catastrophic
forgetting caused by the classifier's bias towards the current task has long
posed a significant challenge. It is mainly caused by the characteristic of
discriminative models. With the growing popularity of the generative
multi-modal models, we would explore replacing discriminative models with
generative ones for CIL. However, transitioning from discriminative to
generative models requires addressing two key challenges. The primary challenge
lies in transferring the generated textual information into the classification
of distinct categories. Additionally, it requires formulating the task of CIL
within a generative framework. To this end, we propose a novel generative
multi-modal model (GMM) framework for class-incremental learning. Our approach
directly generates labels for images using an adapted generative model. After
obtaining the detailed text, we use a text encoder to extract text features and
employ feature matching to determine the most similar label as the
classification prediction. In the conventional CIL settings, we achieve
significantly better results in long-sequence task scenarios. Under the
Few-shot CIL setting, we have improved by at least 14\% accuracy over all the
current state-of-the-art methods with significantly less forgetting. Our code
is available at \url{https://github.com/DoubleClass/GMM}.
comment: Accepted at CVPR 2024
☆ BAM: Box Abstraction Monitors for Real-time OoD Detection in Object Detection
Out-of-distribution (OoD) detection techniques for deep neural networks
(DNNs) become crucial thanks to their filtering of abnormal inputs, especially
when DNNs are used in safety-critical applications and interact with an open
and dynamic environment. Nevertheless, integrating OoD detection into
state-of-the-art (SOTA) object detection DNNs poses significant challenges,
partly due to the complexity introduced by the SOTA OoD construction methods,
which require the modification of DNN architecture and the introduction of
complex loss functions. This paper proposes a simple, yet surprisingly
effective, method that requires neither retraining nor architectural change in
object detection DNN, called Box Abstraction-based Monitors (BAM). The novelty
of BAM stems from using a finite union of convex box abstractions to capture
the learned features of objects for in-distribution (ID) data, and an important
observation that features from OoD data are more likely to fall outside of
these boxes. The union of convex regions within the feature space allows the
formation of non-convex and interpretable decision boundaries, overcoming the
limitations of VOS-like detectors without sacrificing real-time performance.
Experiments integrating BAM into Faster R-CNN-based object detection DNNs
demonstrate a considerably improved performance against SOTA OoD detection
techniques.
☆ Ship in Sight: Diffusion Models for Ship-Image Super Resolution IJCNN
In recent years, remarkable advancements have been achieved in the field of
image generation, primarily driven by the escalating demand for high-quality
outcomes across various image generation subtasks, such as inpainting,
denoising, and super resolution. A major effort is devoted to exploring the
application of super-resolution techniques to enhance the quality of
low-resolution images. In this context, our method explores in depth the
problem of ship image super resolution, which is crucial for coastal and port
surveillance. We investigate the opportunity given by the growing interest in
text-to-image diffusion models, taking advantage of the prior knowledge that
such foundation models have already learned. In particular, we present a
diffusion-model-based architecture that leverages text conditioning during
training while being class-aware, to best preserve the crucial details of the
ships during the generation of the super-resoluted image. Since the specificity
of this task and the scarcity availability of off-the-shelf data, we also
introduce a large labeled ship dataset scraped from online ship images, mostly
from ShipSpotting\footnote{\url{www.shipspotting.com}} website. Our method
achieves more robust results than other deep learning models previously
employed for super resolution, as proven by the multiple experiments performed.
Moreover, we investigate how this model can benefit downstream tasks, such as
classification and object detection, thus emphasizing practical implementation
in a real-world scenario. Experimental results show flexibility, reliability,
and impressive performance of the proposed framework over state-of-the-art
methods for different tasks. The code is available at:
https://github.com/LuigiSigillo/ShipinSight .
comment: Accepted at 2024 International Joint Conference on Neural Networks
(IJCNN)
☆ ViTAR: Vision Transformer with Any Resolution
his paper tackles a significant challenge faced by Vision Transformers
(ViTs): their constrained scalability across different image resolutions.
Typically, ViTs experience a performance decline when processing resolutions
different from those seen during training. Our work introduces two key
innovations to address this issue. Firstly, we propose a novel module for
dynamic resolution adjustment, designed with a single Transformer block,
specifically to achieve highly efficient incremental token integration.
Secondly, we introduce fuzzy positional encoding in the Vision Transformer to
provide consistent positional awareness across multiple resolutions, thereby
preventing overfitting to any single training resolution. Our resulting model,
ViTAR (Vision Transformer with Any Resolution), demonstrates impressive
adaptability, achieving 83.3\% top-1 accuracy at a 1120x1120 resolution and
80.4\% accuracy at a 4032x4032 resolution, all while reducing computational
costs. ViTAR also shows strong performance in downstream tasks such as instance
and semantic segmentation and can easily combined with self-supervised learning
techniques like Masked AutoEncoder. Our work provides a cost-effective solution
for enhancing the resolution scalability of ViTs, paving the way for more
versatile and efficient high-resolution image processing.
☆ Learning CNN on ViT: A Hybrid Model to Explicitly Class-specific Boundaries for Domain Adaptation
Most domain adaptation (DA) methods are based on either a convolutional
neural networks (CNNs) or a vision transformers (ViTs). They align the
distribution differences between domains as encoders without considering their
unique characteristics. For instance, ViT excels in accuracy due to its
superior ability to capture global representations, while CNN has an advantage
in capturing local representations. This fact has led us to design a hybrid
method to fully take advantage of both ViT and CNN, called Explicitly
Class-specific Boundaries (ECB). ECB learns CNN on ViT to combine their
distinct strengths. In particular, we leverage ViT's properties to explicitly
find class-specific decision boundaries by maximizing the discrepancy between
the outputs of the two classifiers to detect target samples far from the source
support. In contrast, the CNN encoder clusters target features based on the
previously defined class-specific boundaries by minimizing the discrepancy
between the probabilities of the two classifiers. Finally, ViT and CNN mutually
exchange knowledge to improve the quality of pseudo labels and reduce the
knowledge discrepancies of these models. Compared to conventional DA methods,
our ECB achieves superior performance, which verifies its effectiveness in this
hybrid model. The project website can be found
https://dotrannhattuong.github.io/ECB/website/.
☆ MonoHair: High-Fidelity Hair Modeling from a Monocular Video CVPR 2024
Keyu Wu, Lingchen Yang, Zhiyi Kuang, Yao Feng, Xutao Han, Yuefan Shen, Hongbo Fu, Kun Zhou, Youyi Zheng
Undoubtedly, high-fidelity 3D hair is crucial for achieving realism, artistic
expression, and immersion in computer graphics. While existing 3D hair modeling
methods have achieved impressive performance, the challenge of achieving
high-quality hair reconstruction persists: they either require strict capture
conditions, making practical applications difficult, or heavily rely on learned
prior data, obscuring fine-grained details in images. To address these
challenges, we propose MonoHair,a generic framework to achieve high-fidelity
hair reconstruction from a monocular video, without specific requirements for
environments. Our approach bifurcates the hair modeling process into two main
stages: precise exterior reconstruction and interior structure inference. The
exterior is meticulously crafted using our Patch-based Multi-View Optimization
(PMVO). This method strategically collects and integrates hair information from
multiple views, independent of prior data, to produce a high-fidelity exterior
3D line map. This map not only captures intricate details but also facilitates
the inference of the hair's inner structure. For the interior, we employ a
data-driven, multi-view 3D hair reconstruction method. This method utilizes 2D
structural renderings derived from the reconstructed exterior, mirroring the
synthetic 2D inputs used during training. This alignment effectively bridges
the domain gap between our training data and real-world data, thereby enhancing
the accuracy and reliability of our interior structure inference. Lastly, we
generate a strand model and resolve the directional ambiguity by our hair
growth algorithm. Our experiments demonstrate that our method exhibits
robustness across diverse hairstyles and achieves state-of-the-art performance.
For more results, please refer to our project page
https://keyuwu-cs.github.io/MonoHair/.
comment: Accepted by IEEE CVPR 2024
☆ Generating Diverse Agricultural Data for Vision-Based Farming Applications
Mikolaj Cieslak, Umabharathi Govindarajan, Alejandro Garcia, Anuradha Chandrashekar, Torsten Hädrich, Aleksander Mendoza-Drosik, Dominik L. Michels, Sören Pirk, Chia-Chun Fu, Wojciech Pałubicki
We present a specialized procedural model for generating synthetic
agricultural scenes, focusing on soybean crops, along with various weeds. This
model is capable of simulating distinct growth stages of these plants, diverse
soil conditions, and randomized field arrangements under varying lighting
conditions. The integration of real-world textures and environmental factors
into the procedural generation process enhances the photorealism and
applicability of the synthetic data. Our dataset includes 12,000 images with
semantic labels, offering a comprehensive resource for computer vision tasks in
precision agriculture, such as semantic segmentation for autonomous weed
control. We validate our model's effectiveness by comparing the synthetic data
against real agricultural images, demonstrating its potential to significantly
augment training data for machine learning models in agriculture. This approach
not only provides a cost-effective solution for generating high-quality,
diverse data but also addresses specific needs in agricultural vision tasks
that are not fully covered by general-purpose models.
comment: 10 pages, 8 figures, 3 tables
☆ A Quantum Fuzzy-based Approach for Real-Time Detection of Solar Coronal Holes
The detection and analysis of the solar coronal holes (CHs) is an important
field of study in the domain of solar physics. Mainly, it is required for the
proper prediction of the geomagnetic storms which directly or indirectly affect
various space and ground-based systems. For the detection of CHs till date, the
solar scientist depends on manual hand-drawn approaches. However, with the
advancement of image processing technologies, some automated image segmentation
methods have been used for the detection of CHs. In-spite of this, fast and
accurate detection of CHs are till a major issues. Here in this work, a novel
quantum computing-based fast fuzzy c-mean technique has been developed for fast
detection of the CHs region. The task has been carried out in two stages, in
first stage the solar image has been segmented using a quantum computing based
fast fuzzy c-mean (QCFFCM) and in the later stage the CHs has been extracted
out from the segmented image based on image morphological operation. In the
work, quantum computing has been used to optimize the cost function of the fast
fuzzy c-mean (FFCM) algorithm, where quantum approximate optimization algorithm
(QAOA) has been used to optimize the quadratic part of the cost function. The
proposed method has been tested for 193 \AA{} SDO/AIA full-disk solar image
datasets and has been compared with the existing techniques. The outcome shows
the comparable performance of the proposed method with the existing one within
a very lesser time.
comment: 14 pages, 5 figures, 3 tables
☆ Quantifying and Mitigating Unimodal Biases in Multimodal Large Language Models: A Causal Perspective
Recent advancements in Large Language Models (LLMs) have facilitated the
development of Multimodal LLMs (MLLMs). Despite their impressive capabilities,
MLLMs often suffer from an over-reliance on unimodal biases (e.g., language
bias and vision bias), leading to incorrect answers in complex multimodal
tasks. To investigate this issue, we propose a causal framework to interpret
the biases in Visual Question Answering (VQA) problems. Within our framework,
we devise a causal graph to elucidate the predictions of MLLMs on VQA problems,
and assess the causal effect of biases through an in-depth causal analysis.
Motivated by the causal graph, we introduce a novel MORE dataset, consisting of
12,000 VQA instances. This dataset is designed to challenge MLLMs' abilities,
necessitating multi-hop reasoning and the surmounting of unimodal biases.
Furthermore, we propose two strategies to mitigate unimodal biases and enhance
MLLMs' reasoning capabilities, including a Decompose-Verify-Answer (DeVA)
framework for limited-access MLLMs and the refinement of open-source MLLMs
through fine-tuning. Extensive quantitative and qualitative experiments offer
valuable insights for future research.
☆ Learning Inclusion Matching for Animation Paint Bucket Colorization CVPR 2024
Colorizing line art is a pivotal task in the production of hand-drawn cel
animation. This typically involves digital painters using a paint bucket tool
to manually color each segment enclosed by lines, based on RGB values
predetermined by a color designer. This frame-by-frame process is both arduous
and time-intensive. Current automated methods mainly focus on segment matching.
This technique migrates colors from a reference to the target frame by aligning
features within line-enclosed segments across frames. However, issues like
occlusion and wrinkles in animations often disrupt these direct
correspondences, leading to mismatches. In this work, we introduce a new
learning-based inclusion matching pipeline, which directs the network to
comprehend the inclusion relationships between segments rather than relying
solely on direct visual correspondences. Our method features a two-stage
pipeline that integrates a coarse color warping module with an inclusion
matching module, enabling more nuanced and accurate colorization. To facilitate
the training of our network, we also develope a unique dataset, referred to as
PaintBucket-Character. This dataset includes rendered line arts alongside their
colorized counterparts, featuring various 3D characters. Extensive experiments
demonstrate the effectiveness and superiority of our method over existing
techniques.
comment: accepted to CVPR 2024. Project Page:
https://ykdai.github.io/projects/InclusionMatching
☆ H2ASeg: Hierarchical Adaptive Interaction and Weighting Network for Tumor Segmentation in PET/CT Images
Positron emission tomography (PET) combined with computed tomography (CT)
imaging is routinely used in cancer diagnosis and prognosis by providing
complementary information. Automatically segmenting tumors in PET/CT images can
significantly improve examination efficiency. Traditional multi-modal
segmentation solutions mainly rely on concatenation operations for modality
fusion, which fail to effectively model the non-linear dependencies between PET
and CT modalities. Recent studies have investigated various approaches to
optimize the fusion of modality-specific features for enhancing joint
representations. However, modality-specific encoders used in these methods
operate independently, inadequately leveraging the synergistic relationships
inherent in PET and CT modalities, for example, the complementarity between
semantics and structure. To address these issues, we propose a Hierarchical
Adaptive Interaction and Weighting Network termed H2ASeg to explore the
intrinsic cross-modal correlations and transfer potential complementary
information. Specifically, we design a Modality-Cooperative Spatial Attention
(MCSA) module that performs intra- and inter-modal interactions globally and
locally. Additionally, a Target-Aware Modality Weighting (TAMW) module is
developed to highlight tumor-related features within multi-modal features,
thereby refining tumor segmentation. By embedding these modules across
different layers, H2ASeg can hierarchically model cross-modal correlations,
enabling a nuanced understanding of both semantic and structural tumor
features. Extensive experiments demonstrate the superiority of H2ASeg,
outperforming state-of-the-art methods on AutoPet-II and Hecktor2022
benchmarks. The code is released at https://github.com/G14nTDo4/H2ASeg.
comment: 10 pages,4 figures
☆ DODA: Diffusion for Object-detection Domain Adaptation in Agriculture
The diverse and high-quality content generated by recent generative models
demonstrates the great potential of using synthetic data to train downstream
models. However, in vision, especially in objection detection, related areas
are not fully explored, the synthetic images are merely used to balance the
long tails of existing datasets, and the accuracy of the generated labels is
low, the full potential of generative models has not been exploited. In this
paper, we propose DODA, a data synthesizer that can generate high-quality
object detection data for new domains in agriculture. Specifically, we improve
the controllability of layout-to-image through encoding layout as an image,
thereby improving the quality of labels, and use a visual encoder to provide
visual clues for the diffusion model to decouple visual features from the
diffusion model, and empowering the model the ability to generate data in new
domains. On the Global Wheat Head Detection (GWHD) Dataset, which is the
largest dataset in agriculture and contains diverse domains, using the data
synthesized by DODA improves the performance of the object detector by
12.74-17.76 AP$_{50}$ in the domain that was significantly shifted from the
training data.
☆ Tracking-Assisted Object Detection with Event Cameras
Ting-Kang Yen, Igor Morawski, Shusil Dangi, Kai He, Chung-Yi Lin, Jia-Fong Yeh, Hung-Ting Su, Winston Hsu
Event-based object detection has recently garnered attention in the computer
vision community due to the exceptional properties of event cameras, such as
high dynamic range and no motion blur. However, feature asynchronism and
sparsity cause invisible objects due to no relative motion to the camera,
posing a significant challenge in the task. Prior works have studied various
memory mechanisms to preserve as many features as possible at the current time,
guided by temporal clues. While these implicit-learned memories retain some
short-term information, they still struggle to preserve long-term features
effectively. In this paper, we consider those invisible objects as
pseudo-occluded objects and aim to reveal their features. Firstly, we introduce
visibility attribute of objects and contribute an auto-labeling algorithm to
append additional visibility labels on an existing event camera dataset.
Secondly, we exploit tracking strategies for pseudo-occluded objects to
maintain their permanence and retain their bounding boxes, even when features
have not been available for a very long time. These strategies can be treated
as an explicit-learned memory guided by the tracking objective to record the
displacements of objects across frames. Lastly, we propose a spatio-temporal
feature aggregation module to enrich the latent features and a consistency loss
to increase the robustness of the overall pipeline. We conduct comprehensive
experiments to verify our method's effectiveness where still objects are
retained but real occluded objects are discarded. The results demonstrate that
(1) the additional visibility labels can assist in supervised training, and (2)
our method outperforms state-of-the-art approaches with a significant
improvement of 7.9% absolute mAP.
☆ PIPNet3D: Interpretable Detection of Alzheimer in MRI Scans
Lisa Anita De Santi, Jörg Schlötterer, Michael Scheschenja, Joel Wessendorf, Meike Nauta, Vincenzo Positano, Christin Seifert
Information from neuroimaging examinations (CT, MRI) is increasingly used to
support diagnoses of dementia, e.g., Alzheimer's disease. While current
clinical practice is mainly based on visual inspection and feature engineering,
Deep Learning approaches can be used to automate the analysis and to discover
new image-biomarkers. Part-prototype neural networks (PP-NN) are an alternative
to standard blackbox models, and have shown promising results in general
computer vision. PP-NN's base their reasoning on prototypical image regions
that are learned fully unsupervised, and combined with a simple-to-understand
decision layer. We present PIPNet3D, a PP-NN for volumetric images. We apply
PIPNet3D to the clinical case study of Alzheimer's Disease diagnosis from
structural Magnetic Resonance Imaging (sMRI). We assess the quality of
prototypes under a systematic evaluation framework, propose new metrics to
evaluate brain prototypes and perform an evaluation with domain experts. Our
results show that PIPNet3D is an interpretable, compact model for Alzheimer's
diagnosis with its reasoning well aligned to medical domain knowledge. Notably,
PIPNet3D achieves the same accuracy as its blackbox counterpart; and removing
the remaining clinically irrelevant prototypes from its decision process does
not decrease predictive performance.
☆ Implementation of the Principal Component Analysis onto High-Performance Computer Facilities for Hyperspectral Dimensionality Reduction: Results and Comparisons
E. Martel, R. Lazcano, J. Lopez, D. Madroñal, R. Salvador, S. Lopez, E. Juarez, R. Guerra, C. Sanz, R. Sarmiento
Dimensionality reduction represents a critical preprocessing step in order to
increase the efficiency and the performance of many hyperspectral imaging
algorithms. However, dimensionality reduction algorithms, such as the Principal
Component Analysis (PCA), suffer from their computationally demanding nature,
becoming advisable for their implementation onto high-performance computer
architectures for applications under strict latency constraints. This work
presents the implementation of the PCA algorithm onto two different
high-performance devices, namely, an NVIDIA Graphics Processing Unit (GPU) and
a Kalray manycore, uncovering a highly valuable set of tips and tricks in order
to take full advantage of the inherent parallelism of these high-performance
computing platforms, and hence, reducing the time that is required to process a
given hyperspectral image. Moreover, the achieved results obtained with
different hyperspectral images have been compared with the ones that were
obtained with a field programmable gate array (FPGA)-based implementation of
the PCA algorithm that has been recently published, providing, for the first
time in the literature, a comprehensive analysis in order to highlight the pros
and cons of each option.
comment: 30 pages, 10 figures
☆ Uncertainty-Aware SAR ATR: Defending Against Adversarial Attacks via Bayesian Neural Networks
Adversarial attacks have demonstrated the vulnerability of Machine Learning
(ML) image classifiers in Synthetic Aperture Radar (SAR) Automatic Target
Recognition (ATR) systems. An adversarial attack can deceive the classifier
into making incorrect predictions by perturbing the input SAR images, for
example, with a few scatterers attached to the on-ground objects. Therefore, it
is critical to develop robust SAR ATR systems that can detect potential
adversarial attacks by leveraging the inherent uncertainty in ML classifiers,
thereby effectively alerting human decision-makers. In this paper, we propose a
novel uncertainty-aware SAR ATR for detecting adversarial attacks.
Specifically, we leverage the capability of Bayesian Neural Networks (BNNs) in
performing image classification with quantified epistemic uncertainty to
measure the confidence for each input SAR image. By evaluating the uncertainty,
our method alerts when the input SAR image is likely to be adversarially
generated. Simultaneously, we also generate visual explanations that reveal the
specific regions in the SAR image where the adversarial scatterers are likely
to to be present, thus aiding human decision-making with hints of evidence of
adversarial attacks. Experiments on the MSTAR dataset demonstrate that our
approach can identify over 80% adversarial SAR images with fewer than 20% false
alarms, and our visual explanations can identify up to over 90% of scatterers
in an adversarial SAR image.
☆ Selective Mixup Fine-Tuning for Optimizing Non-Decomposable Objectives ICLR 2024
Shrinivas Ramasubramanian, Harsh Rangwani, Sho Takemori, Kunal Samanta, Yuhei Umeda, Venkatesh Babu Radhakrishnan
The rise in internet usage has led to the generation of massive amounts of
data, resulting in the adoption of various supervised and semi-supervised
machine learning algorithms, which can effectively utilize the colossal amount
of data to train models. However, before deploying these models in the real
world, these must be strictly evaluated on performance measures like worst-case
recall and satisfy constraints such as fairness. We find that current
state-of-the-art empirical techniques offer sub-optimal performance on these
practical, non-decomposable performance objectives. On the other hand, the
theoretical techniques necessitate training a new model from scratch for each
performance objective. To bridge the gap, we propose SelMix, a selective
mixup-based inexpensive fine-tuning technique for pre-trained models, to
optimize for the desired objective. The core idea of our framework is to
determine a sampling distribution to perform a mixup of features between
samples from particular classes such that it optimizes the given objective. We
comprehensively evaluate our technique against the existing empirical and
theoretically principled methods on standard benchmark datasets for imbalanced
classification. We find that proposed SelMix fine-tuning significantly improves
the performance for various practical non-decomposable objectives across
benchmarks.
comment: ICLR 2024 SpotLight
☆ Multi-scale Unified Network for Image Classification
Convolutional Neural Networks (CNNs) have advanced significantly in visual
representation learning and recognition. However, they face notable challenges
in performance and computational efficiency when dealing with real-world,
multi-scale image inputs. Conventional methods rescale all input images into a
fixed size, wherein a larger fixed size favors performance but rescaling small
size images to a larger size incurs digitization noise and increased
computation cost. In this work, we carry out a comprehensive, layer-wise
investigation of CNN models in response to scale variation, based on Centered
Kernel Alignment (CKA) analysis. The observations reveal lower layers are more
sensitive to input image scale variations than high-level layers. Inspired by
this insight, we propose Multi-scale Unified Network (MUSN) consisting of
multi-scale subnets, a unified network, and scale-invariant constraint. Our
method divides the shallow layers into multi-scale subnets to enable feature
extraction from multi-scale inputs, and the low-level features are unified in
deep layers for extracting high-level semantic features. A scale-invariant
constraint is posed to maintain feature consistency across different scales.
Extensive experiments on ImageNet and other scale-diverse datasets, demonstrate
that MSUN achieves significant improvements in both model performance and
computational efficiency. Particularly, MSUN yields an accuracy increase up to
44.53% and diminishes FLOPs by 7.01-16.13% in multi-scale scenarios.
☆ Efficient Test-Time Adaptation of Vision-Language Models CVPR 2024
Test-time adaptation with pre-trained vision-language models has attracted
increasing attention for tackling distribution shifts during the test time.
Though prior studies have achieved very promising performance, they involve
intensive computation which is severely unaligned with test-time adaptation. We
design TDA, a training-free dynamic adapter that enables effective and
efficient test-time adaptation with vision-language models. TDA works with a
lightweight key-value cache that maintains a dynamic queue with few-shot pseudo
labels as values and the corresponding test-sample features as keys. Leveraging
the key-value cache, TDA allows adapting to test data gradually via progressive
pseudo label refinement which is super-efficient without incurring any
backpropagation. In addition, we introduce negative pseudo labeling that
alleviates the adverse impact of pseudo label noises by assigning pseudo labels
to certain negative classes when the model is uncertain about its pseudo label
predictions. Extensive experiments over two benchmarks demonstrate TDA's
superior effectiveness and efficiency as compared with the state-of-the-art.
The code has been released in \url{https://kdiaaa.github.io/tda/}.
comment: Accepted to CVPR 2024. The code has been released in
\url{https://kdiaaa.github.io/tda/}
☆ Towards Non-Exemplar Semi-Supervised Class-Incremental Learning
Deep neural networks perform remarkably well in close-world scenarios.
However, novel classes emerged continually in real applications, making it
necessary to learn incrementally. Class-incremental learning (CIL) aims to
gradually recognize new classes while maintaining the discriminability of old
ones. Existing CIL methods have two limitations: a heavy reliance on preserving
old data for forgetting mitigation and the need for vast labeled data for
knowledge adaptation. To overcome these issues, we propose a non-exemplar
semi-supervised CIL framework with contrastive learning and semi-supervised
incremental prototype classifier (Semi-IPC). On the one hand, contrastive
learning helps the model learn rich representations, easing the trade-off
between learning representations of new classes and forgetting that of old
classes. On the other hand, Semi-IPC learns a prototype for each class with
unsupervised regularization, enabling the model to incrementally learn from
partially labeled new data while maintaining the knowledge of old classes.
Experiments on benchmark datasets demonstrate the strong performance of our
method: without storing any old samples and only using less than 1% of labels,
Semi-IPC outperforms advanced exemplar-based methods. We hope our work offers
new insights for future CIL research. The code will be made publicly available.
☆ SGDM: Static-Guided Dynamic Module Make Stronger Visual Models
The spatial attention mechanism has been widely used to improve object
detection performance. However, its operation is currently limited to static
convolutions lacking content-adaptive features. This paper innovatively
approaches from the perspective of dynamic convolution. We propose Razor
Dynamic Convolution (RDConv) to address thetwo flaws in dynamic weight
convolution, making it hard to implement in spatial mechanism: 1) it is
computation-heavy; 2) when generating weights, spatial information is
disregarded. Firstly, by using Razor Operation to generate certain features, we
vastly reduce the parameters of the entire dynamic convolution operation.
Secondly, we added a spatial branch inside RDConv to generate convolutional
kernel parameters with richer spatial information. Embedding dynamic
convolution will also bring the problem of sensitivity to high-frequency noise.
We propose the Static-Guided Dynamic Module (SGDM) to address this limitation.
By using SGDM, we utilize a set of asymmetric static convolution kernel
parameters to guide the construction of dynamic convolution. We introduce the
mechanism of shared weights in static convolution to solve the problem of
dynamic convolution being sensitive to high-frequency noise. Extensive
experiments illustrate that multiple different object detection backbones
equipped with SGDM achieve a highly competitive boost in performance(e.g., +4%
mAP with YOLOv5n on VOC and +1.7% mAP with YOLOv8n on COCO) with negligible
parameter increase(i.e., +0.33M on YOLOv5n and +0.19M on YOLOv8n).
comment: 16 pages, 4 figures
☆ AIR-HLoc: Adaptive Image Retrieval for Efficient Visual Localisation
State-of-the-art (SOTA) hierarchical localisation pipelines (HLoc) rely on
image retrieval (IR) techniques to establish 2D-3D correspondences by selecting
the $k$ most similar images from a reference image database for a given query
image. Although higher values of $k$ enhance localisation robustness, the
computational cost for feature matching increases linearly with $k$. In this
paper, we observe that queries that are the most similar to images in the
database result in a higher proportion of feature matches and, thus, more
accurate positioning. Thus, a small number of images is sufficient for queries
very similar to images in the reference database. We then propose a novel
approach, AIR-HLoc, which divides query images into different localisation
difficulty levels based on their similarity to the reference image database. We
consider an image with high similarity to the reference image as an easy query
and an image with low similarity as a hard query. Easy queries show a limited
improvement in accuracy when increasing $k$. Conversely, higher values of $k$
significantly improve accuracy for hard queries. Given the limited improvement
in accuracy when increasing $k$ for easy queries and the significant
improvement for hard queries, we adapt the value of $k$ to the query's
difficulty level. Therefore, AIR-HLoc optimizes processing time by adaptively
assigning different values of $k$ based on the similarity between the query and
reference images without losing accuracy. Our extensive experiments on the
Cambridge Landmarks, 7Scenes, and Aachen Day-Night-v1.1 datasets demonstrate
our algorithm's efficacy, reducing 30\%, 26\%, and 11\% in computational
overhead while maintaining SOTA accuracy compared to HLoc with fixed image
retrieval.
☆ DVLO: Deep Visual-LiDAR Odometry with Local-to-Global Feature Fusion and Bi-Directional Structure Alignment
Information inside visual and LiDAR data is well complementary derived from
the fine-grained texture of images and massive geometric information in point
clouds. However, it remains challenging to explore effective visual-LiDAR
fusion, mainly due to the intrinsic data structure inconsistency between two
modalities: Images are regular and dense, but LiDAR points are unordered and
sparse. To address the problem, we propose a local-to-global fusion network
with bi-directional structure alignment. To obtain locally fused features, we
project points onto image plane as cluster centers and cluster image pixels
around each center. Image pixels are pre-organized as pseudo points for
image-to-point structure alignment. Then, we convert points to pseudo images by
cylindrical projection (point-to-image structure alignment) and perform
adaptive global feature fusion between point features with local fused
features. Our method achieves state-of-the-art performance on KITTI odometry
and FlyingThings3D scene flow datasets compared to both single-modal and
multi-modal methods. Codes will be released later.
☆ Unleashing the Potential of SAM for Medical Adaptation via Hierarchical Decoding CVPR 2024
The Segment Anything Model (SAM) has garnered significant attention for its
versatile segmentation abilities and intuitive prompt-based interface. However,
its application in medical imaging presents challenges, requiring either
substantial training costs and extensive medical datasets for full model
fine-tuning or high-quality prompts for optimal performance. This paper
introduces H-SAM: a prompt-free adaptation of SAM tailored for efficient
fine-tuning of medical images via a two-stage hierarchical decoding procedure.
In the initial stage, H-SAM employs SAM's original decoder to generate a prior
probabilistic mask, guiding a more intricate decoding process in the second
stage. Specifically, we propose two key designs: 1) A class-balanced,
mask-guided self-attention mechanism addressing the unbalanced label
distribution, enhancing image embedding; 2) A learnable mask cross-attention
mechanism spatially modulating the interplay among different image regions
based on the prior mask. Moreover, the inclusion of a hierarchical pixel
decoder in H-SAM enhances its proficiency in capturing fine-grained and
localized details. This approach enables SAM to effectively integrate learned
medical priors, facilitating enhanced adaptation for medical image segmentation
with limited samples. Our H-SAM demonstrates a 4.78% improvement in average
Dice compared to existing prompt-free SAM variants for multi-organ segmentation
using only 10% of 2D slices. Notably, without using any unlabeled data, H-SAM
even outperforms state-of-the-art semi-supervised models relying on extensive
unlabeled training data across various medical datasets. Our code is available
at https://github.com/Cccccczh404/H-SAM.
comment: CVPR 2024
☆ Image Deraining via Self-supervised Reinforcement Learning
The quality of images captured outdoors is often affected by the weather. One
factor that interferes with sight is rain, which can obstruct the view of
observers and computer vision applications that rely on those images. The work
aims to recover rain images by removing rain streaks via Self-supervised
Reinforcement Learning (RL) for image deraining (SRL-Derain). We locate rain
streak pixels from the input rain image via dictionary learning and use
pixel-wise RL agents to take multiple inpainting actions to remove rain
progressively. To our knowledge, this work is the first attempt where
self-supervised RL is applied to image deraining. Experimental results on
several benchmark image-deraining datasets show that the proposed SRL-Derain
performs favorably against state-of-the-art few-shot and self-supervised
deraining and denoising methods.
☆ Branch-Tuning: Balancing Stability and Plasticity for Continual Self-Supervised Learning
Self-supervised learning (SSL) has emerged as an effective paradigm for
deriving general representations from vast amounts of unlabeled data. However,
as real-world applications continually integrate new content, the high
computational and resource demands of SSL necessitate continual learning rather
than complete retraining. This poses a challenge in striking a balance between
stability and plasticity when adapting to new information. In this paper, we
employ Centered Kernel Alignment for quantitatively analyzing model stability
and plasticity, revealing the critical roles of batch normalization layers for
stability and convolutional layers for plasticity. Motivated by this, we
propose Branch-tuning, an efficient and straightforward method that achieves a
balance between stability and plasticity in continual SSL. Branch-tuning
consists of branch expansion and compression, and can be easily applied to
various SSL methods without the need of modifying the original methods,
retaining old data or models. We validate our method through incremental
experiments on various benchmark datasets, demonstrating its effectiveness and
practical value in real-world scenarios. We hope our work offers new insights
for future continual self-supervised learning research. The code will be made
publicly available.
☆ Toward Interactive Regional Understanding in Vision-Large Language Models NAACL 2024
Recent Vision-Language Pre-training (VLP) models have demonstrated
significant advancements. Nevertheless, these models heavily rely on image-text
pairs that capture only coarse and global information of an image, leading to a
limitation in their regional understanding ability. In this work, we introduce
\textbf{RegionVLM}, equipped with explicit regional modeling capabilities,
allowing them to understand user-indicated image regions. To achieve this, we
design a simple yet innovative architecture, requiring no modifications to the
model architecture or objective function. Additionally, we leverage a dataset
that contains a novel source of information, namely Localized Narratives, which
has been overlooked in previous VLP research. Our experiments demonstrate that
our single generalist model not only achieves an interactive dialogue system
but also exhibits superior performance on various zero-shot region
understanding tasks, without compromising its ability for global image
understanding.
comment: NAACL 2024 Main Conference
♻ ☆ Shifting to Machine Supervision: Annotation-Efficient Semi and Self-Supervised Learning for Automatic Medical Image Segmentation and Classification
Pranav Singh, Raviteja Chukkapalli, Shravan Chaudhari, Luoyao Chen, Mei Chen, Jinqian Pan, Craig Smuda, Jacopo Cirrone
Advancements in clinical treatment are increasingly constrained by the
limitations of supervised learning techniques, which depend heavily on large
volumes of annotated data. The annotation process is not only costly but also
demands substantial time from clinical specialists. Addressing this issue, we
introduce the S4MI (Self-Supervision and Semi-Supervision for Medical Imaging)
pipeline, a novel approach that leverages advancements in self-supervised and
semi-supervised learning. These techniques engage in auxiliary tasks that do
not require labeling, thus simplifying the scaling of machine supervision
compared to fully-supervised methods. Our study benchmarks these techniques on
three distinct medical imaging datasets to evaluate their effectiveness in
classification and segmentation tasks. Notably, we observed that self
supervised learning significantly surpassed the performance of supervised
methods in the classification of all evaluated datasets. Remarkably, the
semi-supervised approach demonstrated superior outcomes in segmentation,
outperforming fully-supervised methods while using 50% fewer labels across all
datasets. In line with our commitment to contributing to the scientific
community, we have made the S4MI code openly accessible, allowing for broader
application and further development of these methods.
comment: Seventeen pages (incl. references), five figures, and one table.
(Under Review)
♻ ☆ Boosting Object Detection with Zero-Shot Day-Night Domain Adaptation CVPR 2024
Detecting objects in low-light scenarios presents a persistent challenge, as
detectors trained on well-lit data exhibit significant performance degradation
on low-light data due to low visibility. Previous methods mitigate this issue
by exploring image enhancement or object detection techniques with real
low-light image datasets. However, the progress is impeded by the inherent
difficulties about collecting and annotating low-light images. To address this
challenge, we propose to boost low-light object detection with zero-shot
day-night domain adaptation, which aims to generalize a detector from well-lit
scenarios to low-light ones without requiring real low-light data. Revisiting
Retinex theory in the low-level vision, we first design a reflectance
representation learning module to learn Retinex-based illumination invariance
in images with a carefully designed illumination invariance reinforcement
strategy. Next, an interchange-redecomposition-coherence procedure is
introduced to improve over the vanilla Retinex image decomposition process by
performing two sequential image decompositions and introducing a
redecomposition cohering loss. Extensive experiments on ExDark, DARK FACE, and
CODaN datasets show strong low-light generalizability of our method. Our code
is available at https://github.com/ZPDu/DAI-Net.
comment: Accepted to CVPR 2024
♻ ☆ Decoupled Data Consistency with Diffusion Purification for Image Restoration
Diffusion models have recently gained traction as a powerful class of deep
generative priors, excelling in a wide range of image restoration tasks due to
their exceptional ability to model data distributions. To solve image
restoration problems, many existing techniques achieve data consistency by
incorporating additional likelihood gradient steps into the reverse sampling
process of diffusion models. However, the additional gradient steps pose a
challenge for real-world practical applications as they incur a large
computational overhead, thereby increasing inference time. They also present
additional difficulties when using accelerated diffusion model samplers, as the
number of data consistency steps is limited by the number of reverse sampling
steps. In this work, we propose a novel diffusion-based image restoration
solver that addresses these issues by decoupling the reverse process from the
data consistency steps. Our method involves alternating between a
reconstruction phase to maintain data consistency and a refinement phase that
enforces the prior via diffusion purification. Our approach demonstrates
versatility, making it highly adaptable for efficient problem-solving in latent
space. Additionally, it reduces the necessity for numerous sampling steps
through the integration of consistency models. The efficacy of our approach is
validated through comprehensive experiments across various image restoration
tasks, including image denoising, deblurring, inpainting, and super-resolution.
♻ ☆ Interpretable machine learning for time-to-event prediction in medicine and healthcare
Time-to-event prediction, e.g. cancer survival analysis or hospital length of
stay, is a highly prominent machine learning task in medical and healthcare
applications. However, only a few interpretable machine learning methods comply
with its challenges. To facilitate a comprehensive explanatory analysis of
survival models, we formally introduce time-dependent feature effects and
global feature importance explanations. We show how post-hoc interpretation
methods allow for finding biases in AI systems predicting length of stay using
a novel multi-modal dataset created from 1235 X-ray images with textual
radiology reports annotated by human experts. Moreover, we evaluate cancer
survival models beyond predictive performance to include the importance of
multi-omics feature groups based on a large-scale benchmark comprising 11
datasets from The Cancer Genome Atlas (TCGA). Model developers can use the
proposed methods to debug and improve machine learning algorithms, while
physicians can discover disease biomarkers and assess their significance. We
hope the contributed open data and code resources facilitate future work in the
emerging research direction of explainable survival analysis.
comment: An extended version of an AIME 2023 paper submitted to Artificial
Intelligence in Medicine
♻ ☆ Simplified Diffusion Schrödinger Bridge
This paper introduces a novel theoretical simplification of the Diffusion
Schr\"odinger Bridge (DSB) that facilitates its unification with Score-based
Generative Models (SGMs), addressing the limitations of DSB in complex data
generation and enabling faster convergence and enhanced performance. By
employing SGMs as an initial solution for DSB, our approach capitalizes on the
strengths of both frameworks, ensuring a more efficient training process and
improving the performance of SGM. We also propose a reparameterization
technique that, despite theoretical approximations, practically improves the
network's fitting capabilities. Our extensive experimental evaluations confirm
the effectiveness of the simplified DSB, demonstrating its significant
improvements. We believe the contributions of this work pave the way for
advanced generative modeling. The code is available at
https://github.com/checkcrab/SDSB.
♻ ☆ Self-supervised co-salient object detection via feature correspondence at multiple scales
Our paper introduces a novel two-stage self-supervised approach for detecting
co-occurring salient objects (CoSOD) in image groups without requiring
segmentation annotations. Unlike existing unsupervised methods that rely solely
on patch-level information (e.g. clustering patch descriptors) or on
computation heavy off-the-shelf components for CoSOD, our lightweight model
leverages feature correspondences at both patch and region levels,
significantly improving prediction performance. In the first stage, we train a
self-supervised network that detects co-salient regions by computing local
patch-level feature correspondences across images. We obtain the segmentation
predictions using confidence-based adaptive thresholding. In the next stage, we
refine these intermediate segmentations by eliminating the detected regions
(within each image) whose averaged feature representations are dissimilar to
the foreground feature representation averaged across all the cross-attention
maps (from the previous stage). Extensive experiments on three CoSOD benchmark
datasets show that our self-supervised model outperforms the corresponding
state-of-the-art models by a huge margin (e.g. on the CoCA dataset, our model
has a 13.7% F-measure gain over the SOTA unsupervised CoSOD model). Notably,
our self-supervised model also outperforms several recent fully supervised
CoSOD models on the three test datasets (e.g., on the CoCA dataset, our model
has a 4.6% F-measure gain over a recent supervised CoSOD model).
♻ ☆ LION: Implicit Vision Prompt Tuning AAAI2024
Despite recent competitive performance across a range of vision tasks, vision
Transformers still have an issue of heavy computational costs. Recently, vision
prompt learning has provided an economic solution to this problem without
fine-tuning the whole large-scale models. However, the efficiency of existing
models are still far from satisfactory due to insertion of extensive prompts
blocks and trick prompt designs. In this paper, we propose an efficient vision
model named impLicit vIsion prOmpt tuNing (LION), which is motivated by deep
implicit models with stable memory costs for various complex tasks. In
particular, we merely insect two equilibrium implicit layers in two ends of the
pre-trained main backbone with parameters in the backbone frozen. Moreover, we
prune the parameters in these two layers according to lottery hypothesis. The
performance obtained by our LION are promising on a wide range of datasets. In
particular, our LION reduces up to 11.5% of training parameter numbers while
obtaining higher performance compared with the state-of-the-art baseline VPT,
especially under challenging scenes. Furthermore, we find that our proposed
LION had a good generalization performance, making it an easy way to boost
transfer learning in the future.
comment: Accepted by AAAI2024; 9 pages, 3 figures, 4 tables
♻ ☆ Incorporating simulated spatial context information improves the effectiveness of contrastive learning models
Visual learning often occurs in a specific context, where an agent acquires
skills through exploration and tracking of its location in a consistent
environment. The historical spatial context of the agent provides a similarity
signal for self-supervised contrastive learning. We present a unique approach,
termed Environmental Spatial Similarity (ESS), that complements existing
contrastive learning methods. Using images from simulated, photorealistic
environments as an experimental setting, we demonstrate that ESS outperforms
traditional instance discrimination approaches. Moreover, sampling additional
data from the same environment substantially improves accuracy and provides new
augmentations. ESS allows remarkable proficiency in room classification and
spatial prediction tasks, especially in unfamiliar environments. This learning
paradigm has the potential to enable rapid visual learning in agents operating
in new environments with unique visual characteristics. Potentially
transformative applications span from robotics to space exploration. Our proof
of concept demonstrates improved efficiency over methods that rely on
extensive, disconnected datasets.
♻ ☆ Adaptive Negative Evidential Deep Learning for Open-set Semi-supervised Learning AAAI2024
Semi-supervised learning (SSL) methods assume that labeled data, unlabeled
data and test data are from the same distribution. Open-set semi-supervised
learning (Open-set SSL) considers a more practical scenario, where unlabeled
data and test data contain new categories (outliers) not observed in labeled
data (inliers). Most previous works focused on outlier detection via binary
classifiers, which suffer from insufficient scalability and inability to
distinguish different types of uncertainty. In this paper, we propose a novel
framework, Adaptive Negative Evidential Deep Learning (ANEDL) to tackle these
limitations. Concretely, we first introduce evidential deep learning (EDL) as
an outlier detector to quantify different types of uncertainty, and design
different uncertainty metrics for self-training and inference. Furthermore, we
propose a novel adaptive negative optimization strategy, making EDL more
tailored to the unlabeled dataset containing both inliers and outliers. As
demonstrated empirically, our proposed method outperforms existing
state-of-the-art methods across four datasets.
comment: Accepted by AAAI2024
♻ ☆ Vision Transformer-Based Deep Learning for Histologic Classification of Endometrial Cancer
Manu Goyal, Laura J. Tafe, James X. Feng, Kristen E. Muller, Liesbeth Hondelink, Jessica L. Bentz, Saeed Hassanpour
Endometrial cancer, the fourth most common cancer in females in the United
States, with the lifetime risk for developing this disease is approximately
2.8% in women. Precise histologic evaluation and molecular classification of
endometrial cancer is important for effective patient management and
determining the best treatment modalities. This study introduces EndoNet, which
uses convolutional neural networks for extracting histologic features and a
vision transformer for aggregating these features and classifying slides based
on their visual characteristics into high- and low- grade. The model was
trained on 929 digitized hematoxylin and eosin-stained whole-slide images of
endometrial cancer from hysterectomy cases at Dartmouth-Health. It classifies
these slides into low-grade (Endometroid Grades 1 and 2) and high-grade
(endometroid carcinoma FIGO grade 3, uterine serous carcinoma, carcinosarcoma)
categories. EndoNet was evaluated on an internal test set of 110 patients and
an external test set of 100 patients from the public TCGA database. The model
achieved a weighted average F1-score of 0.91 (95% CI: 0.86-0.95) and an AUC of
0.95 (95% CI: 0.89-0.99) on the internal test, and 0.86 (95% CI: 0.80-0.94) for
F1-score and 0.86 (95% CI: 0.75-0.93) for AUC on the external test. Pending
further validation, EndoNet has the potential to support pathologists without
the need of manual annotations in classifying the grades of gynecologic
pathology tumors.
comment: 4 Tables and 3 Figures
♻ ☆ Automated Construction of Time-Space Diagrams for Traffic Analysis Using Street-View Video Sequence SC
Time-space diagrams are essential tools for analyzing traffic patterns and
optimizing transportation infrastructure and traffic management strategies.
Traditional data collection methods for these diagrams have limitations in
terms of temporal and spatial coverage. Recent advancements in camera
technology have overcome these limitations and provided extensive urban data.
In this study, we propose an innovative approach to constructing time-space
diagrams by utilizing street-view video sequences captured by cameras mounted
on moving vehicles. Using the state-of-the-art YOLOv5, StrongSORT, and
photogrammetry techniques for distance calculation, we can infer vehicle
trajectories from the video data and generate time-space diagrams. To evaluate
the effectiveness of our proposed method, we utilized datasets from the KITTI
computer vision benchmark suite. The evaluation results demonstrate that our
approach can generate trajectories from video data, although there are some
errors that can be mitigated by improving the performance of the detector,
tracker, and distance calculation components. In conclusion, the utilization of
street-view video sequences captured by cameras mounted on moving vehicles,
combined with state-of-the-art computer vision techniques, has immense
potential for constructing comprehensive time-space diagrams. These diagrams
offer valuable insights into traffic patterns and contribute to the design of
transportation infrastructure and traffic management strategies.
comment: The paper is published in 2023 IEEE 26th International Conference on
Intelligent Transportation Systems (ITSC)
♻ ☆ SOAC: Spatio-Temporal Overlap-Aware Multi-Sensor Calibration using Neural Radiance Fields CVPR 2024
Quentin Herau, Nathan Piasco, Moussab Bennehar, Luis Roldão, Dzmitry Tsishkou, Cyrille Migniot, Pascal Vasseur, Cédric Demonceaux
In rapidly-evolving domains such as autonomous driving, the use of multiple
sensors with different modalities is crucial to ensure high operational
precision and stability. To correctly exploit the provided information by each
sensor in a single common frame, it is essential for these sensors to be
accurately calibrated. In this paper, we leverage the ability of Neural
Radiance Fields (NeRF) to represent different sensors modalities in a common
volumetric representation to achieve robust and accurate spatio-temporal sensor
calibration. By designing a partitioning approach based on the visible part of
the scene for each sensor, we formulate the calibration problem using only the
overlapping areas. This strategy results in a more robust and accurate
calibration that is less prone to failure. We demonstrate that our approach
works on outdoor urban scenes by validating it on multiple established driving
datasets. Results show that our method is able to get better accuracy and
robustness compared to existing methods.
comment: Accepted at CVPR 2024. Project page: https://qherau.github.io/SOAC/
♻ ☆ Point, Segment and Count: A Generalized Framework for Object Counting CVPR 2024
Class-agnostic object counting aims to count all objects in an image with
respect to example boxes or class names, \emph{a.k.a} few-shot and zero-shot
counting. In this paper, we propose a generalized framework for both few-shot
and zero-shot object counting based on detection. Our framework combines the
superior advantages of two foundation models without compromising their
zero-shot capability: (\textbf{i}) SAM to segment all possible objects as mask
proposals, and (\textbf{ii}) CLIP to classify proposals to obtain accurate
object counts. However, this strategy meets the obstacles of efficiency
overhead and the small crowded objects that cannot be localized and
distinguished. To address these issues, our framework, termed PseCo, follows
three steps: point, segment, and count. Specifically, we first propose a
class-agnostic object localization to provide accurate but least point prompts
for SAM, which consequently not only reduces computation costs but also avoids
missing small objects. Furthermore, we propose a generalized object
classification that leverages CLIP image/text embeddings as the classifier,
following a hierarchical knowledge distillation to obtain discriminative
classifications among hierarchical mask proposals. Extensive experimental
results on FSC-147, COCO, and LVIS demonstrate that PseCo achieves
state-of-the-art performance in both few-shot/zero-shot object
counting/detection. Code: https://github.com/Hzzone/PseCo
comment: Accepted by CVPR 2024. Camera ready
♻ ☆ Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation CVPR 2024
Xingqun Qi, Jiahao Pan, Peng Li, Ruibin Yuan, Xiaowei Chi, Mengfei Li, Wenhan Luo, Wei Xue, Shanghang Zhang, Qifeng Liu, Yike Guo
Generating vivid and emotional 3D co-speech gestures is crucial for virtual
avatar animation in human-machine interaction applications. While the existing
methods enable generating the gestures to follow a single emotion label, they
overlook that long gesture sequence modeling with emotion transition is more
practical in real scenes. In addition, the lack of large-scale available
datasets with emotional transition speech and corresponding 3D human gestures
also limits the addressing of this task. To fulfill this goal, we first
incorporate the ChatGPT-4 and an audio inpainting approach to construct the
high-fidelity emotion transition human speeches. Considering obtaining the
realistic 3D pose annotations corresponding to the dynamically inpainted
emotion transition audio is extremely difficult, we propose a novel weakly
supervised training strategy to encourage authority gesture transitions.
Specifically, to enhance the coordination of transition gestures w.r.t
different emotional ones, we model the temporal association representation
between two different emotional gesture sequences as style guidance and infuse
it into the transition generation. We further devise an emotion mixture
mechanism that provides weak supervision based on a learnable mixed emotion
label for transition gestures. Last, we present a keyframe sampler to supply
effective initial posture cues in long sequences, enabling us to generate
diverse gestures. Extensive experiments demonstrate that our method outperforms
the state-of-the-art models constructed by adapting single emotion-conditioned
counterparts on our newly defined emotion transition task and datasets. Our
code and dataset will be released on the project page:
https://xingqunqi-lab.github.io/Emo-Transition-Gesture/.
comment: Accepted by CVPR 2024
♻ ☆ Learning by Erasing: Conditional Entropy based Transferable Out-Of-Distribution Detection
Out-of-distribution (OOD) detection is essential to handle the distribution
shifts between training and test scenarios. For a new in-distribution (ID)
dataset, existing methods require retraining to capture the dataset-specific
feature representation or data distribution. In this paper, we propose a deep
generative models (DGM) based transferable OOD detection method, which is
unnecessary to retrain on a new ID dataset. We design an image erasing strategy
to equip exclusive conditional entropy distribution for each ID dataset, which
determines the discrepancy of DGM's posteriori ucertainty distribution on
different ID datasets. Owing to the powerful representation capacity of
convolutional neural networks, the proposed model trained on complex dataset
can capture the above discrepancy between ID datasets without retraining and
thus achieve transferable OOD detection. We validate the proposed method on
five datasets and verity that ours achieves comparable performance to the
state-of-the-art group based OOD detection methods that need to be retrained to
deploy on new ID datasets. Our code is available at
https://github.com/oOHCIOo/CETOOD.
comment: update new experimental results
♻ ☆ Dual Structure-Aware Image Filterings for Semi-supervised Medical Image Segmentation
Semi-supervised image segmentation has attracted great attention recently.
The key is how to leverage unlabeled images in the training process. Most
methods maintain consistent predictions of the unlabeled images under
variations (e.g., adding noise/perturbations, or creating alternative versions)
in the image and/or model level. In most image-level variation, medical images
often have prior structure information, which has not been well explored. In
this paper, we propose novel dual structure-aware image filterings (DSAIF) as
the image-level variations for semi-supervised medical image segmentation.
Motivated by connected filtering that simplifies image via filtering in
structure-aware tree-based image representation, we resort to the dual contrast
invariant Max-tree and Min-tree representation. Specifically, we propose a
novel connected filtering that removes topologically equivalent nodes (i.e.
connected components) having no siblings in the Max/Min-tree. This results in
two filtered images preserving topologically critical structure. Applying the
proposed DSAIF to mutually supervised networks decreases the consensus of their
erroneous predictions on unlabeled images. This helps to alleviate the
confirmation bias issue of overfitting to noisy pseudo labels of unlabeled
images, and thus effectively improves the segmentation performance. Extensive
experimental results on three benchmark datasets demonstrate that the proposed
method significantly/consistently outperforms some state-of-the-art methods.
The source codes will be publicly available.
♻ ☆ Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect Vision-Language Pre-training Framework CVPR2024
Vu Minh Hieu Phan, Yutong Xie, Yuankai Qi, Lingqiao Liu, Liyang Liu, Bowen Zhang, Zhibin Liao, Qi Wu, Minh-Son To, Johan W. Verjans
Medical vision language pre-training (VLP) has emerged as a frontier of
research, enabling zero-shot pathological recognition by comparing the query
image with the textual descriptions for each disease. Due to the complex
semantics of biomedical texts, current methods struggle to align medical images
with key pathological findings in unstructured reports. This leads to the
misalignment with the target disease's textual representation. In this paper,
we introduce a novel VLP framework designed to dissect disease descriptions
into their fundamental aspects, leveraging prior knowledge about the visual
manifestations of pathologies. This is achieved by consulting a large language
model and medical experts. Integrating a Transformer module, our approach
aligns an input image with the diverse elements of a disease, generating
aspect-centric image representations. By consolidating the matches from each
aspect, we improve the compatibility between an image and its associated
disease. Additionally, capitalizing on the aspect-oriented representations, we
present a dual-head Transformer tailored to process known and unknown diseases,
optimizing the comprehensive detection efficacy. Conducting experiments on
seven downstream datasets, ours improves the accuracy of recent methods by up
to 8.56% and 17.0% for seen and unseen categories, respectively. Our code is
released at https://github.com/HieuPhan33/MAVL.
comment: Accepted at CVPR2024. Pre-print before final camera-ready version
♻ ☆ Shapley Values-Powered Framework for Fair Reward Split in Content Produced by GenAI
It is evident that, currently, generative models are surpassed in quality by
human professionals. However, with the advancements in Artificial Intelligence,
this gap will narrow, leading to scenarios where individuals who have dedicated
years of their lives to mastering a skill become obsolete due to their high
costs, which are inherently linked to the time they require to complete a task
-- a task that AI could accomplish in minutes or seconds. To avoid future
social upheavals, we must, even now, contemplate how to fairly assess the
contributions of such individuals in training generative models and how to
compensate them for the reduction or complete loss of their incomes. In this
work, we propose a method to structure collaboration between model developers
and data providers. To achieve this, we employ Shapley Values to quantify the
contribution of artist(s) in an image generated by the Stable Diffusion-v1.5
model and to equitably allocate the reward among them.
comment: 36 pages, 32 figures
♻ ☆ E4S: Fine-grained Face Swapping via Editing With Regional GAN Inversion
This paper proposes a novel approach to face swapping from the perspective of
fine-grained facial editing, dubbed "editing for swapping" (E4S). The
traditional face swapping methods rely on global feature extraction and fail to
preserve the detailed source identity. In contrast, we propose a Regional GAN
Inversion (RGI) method, which allows the explicit disentanglement of shape and
texture. Specifically, our E4S performs face swapping in the latent space of a
pretrained StyleGAN, where a multi-scale mask-guided encoder is applied to
project the texture of each facial component into regional style codes and a
mask-guided injection module manipulating feature maps with the style codes.
Based on this disentanglement, face swapping can be simplified as style and
mask swapping. Besides, due to the large lighting condition gap, transferring
the source skin into the target image may lead to disharmony lighting. We
propose a re-coloring network to make the swapped face maintain the target
lighting condition while preserving the source skin. Further, to deal with the
potential mismatch areas during mask exchange, we design a face inpainting
module to refine the face shape. The extensive comparisons with
state-of-the-art methods demonstrate that our E4S outperforms existing methods
in preserving texture, shape, and lighting. Our implementation is available at
https://github.com/e4s2024/E4S2024.
comment: Project Page: https://e4s2024.github.io/ ;. arXiv admin note: text
overlap with arXiv:2211.14068
♻ ☆ ViDA: Homeostatic Visual Domain Adapter for Continual Test Time Adaptation ICLR2024
Jiaming Liu, Senqiao Yang, Peidong Jia, Renrui Zhang, Ming Lu, Yandong Guo, Wei Xue, Shanghang Zhang
Since real-world machine systems are running in non-stationary environments,
Continual Test-Time Adaptation (CTTA) task is proposed to adapt the pre-trained
model to continually changing target domains. Recently, existing methods mainly
focus on model-based adaptation, which aims to leverage a self-training manner
to extract the target domain knowledge. However, pseudo labels can be noisy and
the updated model parameters are unreliable under dynamic data distributions,
leading to error accumulation and catastrophic forgetting in the continual
adaptation process. To tackle these challenges and maintain the model
plasticity, we design a Visual Domain Adapter (ViDA) for CTTA, explicitly
handling both domain-specific and domain-shared knowledge. Specifically, we
first comprehensively explore the different domain representations of the
adapters with trainable high-rank or low-rank embedding spaces. Then we inject
ViDAs into the pre-trained model, which leverages high-rank and low-rank
features to adapt the current domain distribution and maintain the continual
domain-shared knowledge, respectively. To exploit the low-rank and high-rank
ViDAs more effectively, we further propose a Homeostatic Knowledge Allotment
(HKA) strategy, which adaptively combines different knowledge from each ViDA.
Extensive experiments conducted on four widely used benchmarks demonstrate that
our proposed method achieves state-of-the-art performance in both
classification and segmentation CTTA tasks. Note that, our method can be
regarded as a novel transfer paradigm for large-scale models, delivering
promising results in adaptation to continually changing distributions. Project
page: https://sites.google.com/view/iclr2024-vida/home.
comment: Accepted by ICLR2024
♻ ☆ Visually Guided Generative Text-Layout Pre-training for Document Intelligence NAACL 2024
Prior study shows that pre-training techniques can boost the performance of
visual document understanding (VDU), which typically requires models to gain
abilities to perceive and reason both document texts and layouts (e.g.,
locations of texts and table-cells). To this end, we propose visually guided
generative text-layout pre-training, named ViTLP. Given a document image, the
model optimizes hierarchical language and layout modeling objectives to
generate the interleaved text and layout sequence. In addition, to address the
limitation of processing long documents by Transformers, we introduce a
straightforward yet effective multi-segment generative pre-training scheme,
facilitating ViTLP to process word-intensive documents of any length. ViTLP can
function as a native OCR model to localize and recognize texts of document
images. Besides, ViTLP can be effectively applied to various downstream VDU
tasks. Extensive experiments show that ViTLP achieves competitive performance
over existing baselines on benchmark VDU tasks, including information
extraction, document classification, and document question answering.
comment: Accepted to NAACL 2024 main conference. The first version of this
paper was submitted to OpenReview
(https://openreview.net/forum?id=ARtBIBAmNR) in June 2023
♻ ☆ Intraoperative 2D/3D Image Registration via Differentiable X-ray Rendering CVPR 2024
Surgical decisions are informed by aligning rapid portable 2D intraoperative
images (e.g., X-rays) to a high-fidelity 3D preoperative reference scan (e.g.,
CT). 2D/3D image registration often fails in practice: conventional
optimization methods are prohibitively slow and susceptible to local minima,
while neural networks trained on small datasets fail on new patients or require
impractical landmark supervision. We present DiffPose, a self-supervised
approach that leverages patient-specific simulation and differentiable
physics-based rendering to achieve accurate 2D/3D registration without relying
on manually labeled data. Preoperatively, a CNN is trained to regress the pose
of a randomly oriented synthetic X-ray rendered from the preoperative CT. The
CNN then initializes rapid intraoperative test-time optimization that uses the
differentiable X-ray renderer to refine the solution. Our work further proposes
several geometrically principled methods for sampling camera poses from
$\mathbf{SE}(3)$, for sparse differentiable rendering, and for driving
registration in the tangent space $\mathfrak{se}(3)$ with geodesic and
multiscale locality-sensitive losses. DiffPose achieves sub-millimeter accuracy
across surgical datasets at intraoperative speeds, improving upon existing
unsupervised methods by an order of magnitude and even outperforming supervised
baselines. Our code is available at https://github.com/eigenvivek/DiffPose.
comment: CVPR 2024
♻ ☆ Challenging Common Paradigms in Multi-Task Learning
While multi-task learning (MTL) has gained significant attention in recent
years, its underlying mechanisms remain poorly understood. Recent methods did
not yield consistent performance improvements over single task learning (STL)
baselines, underscoring the importance of gaining more profound insights about
challenges specific to MTL. In our study, we challenge paradigms in MTL in the
context of STL: First, the impact of the choice of optimizer has only been
mildly investigated in MTL. We show the pivotal role of common STL tools such
as the Adam optimizer in MTL empirically in various experiments. To further
investigate Adam's effectiveness, we theoretical derive a partial loss-scale
invariance under mild assumptions. Second, the notion of gradient conflicts has
often been phrased as a specific problem in MTL. We delve into the role of
gradient conflicts in MTL and compare it to STL. For angular gradient alignment
we find no evidence that this is a unique problem in MTL. We emphasize
differences in gradient magnitude as the main distinguishing factor. Lastly, we
compare the transferability of features learned through MTL and STL on common
image corruptions, and find light evidence that MTL can lead to superior
transferability. Overall, we find surprising similarities between STL and MTL
suggesting to consider methods from both fields in a broader context.
comment: -
♻ ☆ Neural Fields for Interactive Visualization of Statistical Dependencies in 3D Simulation Ensembles
Fatemeh Farokhmanesh, Kevin Höhlein, Christoph Neuhauser, Tobias Necker, Martin Weissmann, Takemasa Miyoshi, Rüdiger Westermann
We present the first neural network that has learned to compactly represent
and can efficiently reconstruct the statistical dependencies between the values
of physical variables at different spatial locations in large 3D simulation
ensembles. Going beyond linear dependencies, we consider mutual information as
a measure of non-linear dependence. We demonstrate learning and reconstruction
with a large weather forecast ensemble comprising 1000 members, each storing
multiple physical variables at a 250 x 352 x 20 simulation grid. By
circumventing compute-intensive statistical estimators at runtime, we
demonstrate significantly reduced memory and computation requirements for
reconstructing the major dependence structures. This enables embedding the
estimator into a GPU-accelerated direct volume renderer and interactively
visualizing all mutual dependencies for a selected domain point.
♻ ☆ SAR-Net: Multi-scale Direction-aware SAR Network via Global Information Fusion
Deep learning has driven significant progress in object detection using
Synthetic Aperture Radar (SAR) imagery. Existing methods, while achieving
promising results, often struggle to effectively integrate local and global
information, particularly direction-aware features. This paper proposes
SAR-Net, a novel framework specifically designed for global fusion of
direction-aware information in SAR object detection. SAR-Net leverages two key
innovations: the Unity Compensation Mechanism (UCM) and the Direction-aware
Attention Module (DAM). UCM facilitates the establishment of complementary
relationships among features across different scales, enabling efficient global
information fusion. Among them, Multi-scale Alignment Module (MAM) and distinct
Multi-level Fusion Module (MFM) enhance feature integration by capturing both
texture detail and semantic information. Then, Multi-feature Embedding Module
(MEM) feeds back global features into the primary branches, further improving
information transmission. Additionally, DAM, through bidirectional attention
polymerization, captures direction-aware information, effectively eliminating
background interference. Extensive experiments demonstrate the effectiveness of
SAR-Net, achieving state-of-the-art results on aircraft (SAR-AIRcraft-1.0) and
ship datasets (SSDD, HRSID), confirming its generalization capability and
robustness.
♻ ☆ Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation CVPR 2024
Transformers have been successfully applied in the field of video-based 3D
human pose estimation. However, the high computational costs of these video
pose transformers (VPTs) make them impractical on resource-constrained devices.
In this paper, we present a plug-and-play pruning-and-recovering framework,
called Hourglass Tokenizer (HoT), for efficient transformer-based 3D human pose
estimation from videos. Our HoT begins with pruning pose tokens of redundant
frames and ends with recovering full-length tokens, resulting in a few pose
tokens in the intermediate transformer blocks and thus improving the model
efficiency. To effectively achieve this, we propose a token pruning cluster
(TPC) that dynamically selects a few representative tokens with high semantic
diversity while eliminating the redundancy of video frames. In addition, we
develop a token recovering attention (TRA) to restore the detailed
spatio-temporal information based on the selected tokens, thereby expanding the
network output to the original full-length temporal resolution for fast
inference. Extensive experiments on two benchmark datasets (i.e., Human3.6M and
MPI-INF-3DHP) demonstrate that our method can achieve both high efficiency and
estimation accuracy compared to the original VPT models. For instance, applying
to MotionBERT and MixSTE on Human3.6M, our HoT can save nearly 50% FLOPs
without sacrificing accuracy and nearly 40% FLOPs with only 0.2% accuracy drop,
respectively. Code and models are available at
https://github.com/NationalGAILab/HoT.
comment: Accepted by CVPR 2024, Open Sourced
♻ ☆ Enhancing Object Coherence in Layout-to-Image Synthesis
Layout-to-image synthesis is an emerging technique in conditional image
generation. It aims to generate complex scenes, where users require fine
control over the layout of the objects in a scene. However, it remains
challenging to control the object coherence, including semantic coherence
(e.g., the cat looks at the flowers or not) and physical coherence (e.g., the
hand and the racket should not be misaligned). In this paper, we propose a
novel diffusion model with effective global semantic fusion (GSF) and
self-similarity feature enhancement modules to guide the object coherence for
this task. For semantic coherence, we argue that the image caption contains
rich information for defining the semantic relationship within the objects in
the images. Instead of simply employing cross-attention between captions and
generated images, which addresses the highly relevant layout restriction and
semantic coherence separately and thus leads to unsatisfying results shown in
our experiments, we develop GSF to fuse the supervision from the layout
restriction and semantic coherence requirement and exploit it to guide the
image synthesis process. Moreover, to improve the physical coherence, we
develop a Self-similarity Coherence Attention (SCA) module to explicitly
integrate local contextual physical coherence into each pixel's generation
process. Specifically, we adopt a self-similarity map to encode the coherence
restrictions and employ it to extract coherent features from text embedding.
Through visualization of our self-similarity map, we explore the essence of
SCA, revealing that its effectiveness is not only in capturing reliable
physical coherence patterns but also in enhancing complex texture generation.
Extensive experiments demonstrate the superiority of our proposed method in
both image generation quality and controllability.
♻ ☆ BEVUDA: Multi-geometric Space Alignments for Domain Adaptive BEV 3D Object Detection ICRA2024
Jiaming Liu, Rongyu Zhang, Xiaoqi Li, Xiaowei Chi, Zehui Chen, Ming Lu, Yandong Guo, Shanghang Zhang
Vision-centric bird-eye-view (BEV) perception has shown promising potential
in autonomous driving. Recent works mainly focus on improving efficiency or
accuracy but neglect the challenges when facing environment changing, resulting
in severe degradation of transfer performance. For BEV perception, we figure
out the significant domain gaps existing in typical real-world cross-domain
scenarios and comprehensively solve the Domain Adaption (DA) problem for
multi-view 3D object detection. Since BEV perception approaches are complicated
and contain several components, the domain shift accumulation on multiple
geometric spaces (i.e., 2D, 3D Voxel, BEV) makes BEV DA even challenging. In
this paper, we propose a Multi-space Alignment Teacher-Student (MATS) framework
to ease the domain shift accumulation, which consists of a Depth-Aware Teacher
(DAT) and a Geometric-space Aligned Student (GAS) model. DAT tactfully combines
target lidar and reliable depth prediction to construct depth-aware
information, extracting target domain-specific knowledge in Voxel and BEV
feature spaces. It then transfers the sufficient domain knowledge of multiple
spaces to the student model. In order to jointly alleviate the domain shift,
GAS projects multi-geometric space features to a shared geometric embedding
space and decreases data distribution distance between two domains. To verify
the effectiveness of our method, we conduct BEV 3D object detection experiments
on three cross-domain scenarios and achieve state-of-the-art performance.
comment: Accepted by ICRA2024
♻ ☆ Back to 3D: Few-Shot 3D Keypoint Detection with Back-Projected 2D Features CVPR 2024
With the immense growth of dataset sizes and computing resources in recent
years, so-called foundation models have become popular in NLP and vision tasks.
In this work, we propose to explore foundation models for the task of keypoint
detection on 3D shapes. A unique characteristic of keypoint detection is that
it requires semantic and geometric awareness while demanding high localization
accuracy. To address this problem, we propose, first, to back-project features
from large pre-trained 2D vision models onto 3D shapes and employ them for this
task. We show that we obtain robust 3D features that contain rich semantic
information and analyze multiple candidate features stemming from different 2D
foundation models. Second, we employ a keypoint candidate optimization module
which aims to match the average observed distribution of keypoints on the shape
and is guided by the back-projected features. The resulting approach achieves a
new state of the art for few-shot keypoint detection on the KeyPointNet
dataset, almost doubling the performance of the previous best methods.
comment: Accepted to CVPR 2024, Project page:
https://wimmerth.github.io/back-to-3d.html
♻ ☆ Fast Dynamic 3D Object Generation from a Single-view Video
Generating dynamic 3D object from a single-view video is challenging due to
the lack of 4D labeled data. Extending image-to-3D pipelines by transferring
off-the-shelf image generation models such as score distillation sampling,
existing methods tend to be slow and expensive to scale due to the need for
back-propagating the information-limited supervision signals through a large
pretrained model. To address this, we propose an efficient video-to-4D object
generation framework called Efficient4D. It generates high-quality
spacetime-consistent images under different camera views, and then uses them as
labeled data to directly train a novel 4D Gaussian splatting model with
explicit point cloud geometry, enabling real-time rendering under continuous
camera trajectories. Extensive experiments on synthetic and real videos show
that Efficient4D offers a remarkable 20-fold increase in speed when compared to
prior art alternatives while preserving the quality of novel view synthesis.
For example, Efficient4D takes only 6 mins to model a dynamic object, vs 120
mins by Consistent4D.
comment: Technical report
♻ ☆ UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction
Lan Feng, Mohammadhossein Bahari, Kaouther Messaoud Ben Amor, Éloi Zablocki, Matthieu Cord, Alexandre Alahi
Vehicle trajectory prediction has increasingly relied on data-driven
solutions, but their ability to scale to different data domains and the impact
of larger dataset sizes on their generalization remain under-explored. While
these questions can be studied by employing multiple datasets, it is
challenging due to several discrepancies, e.g., in data formats, map
resolution, and semantic annotation types. To address these challenges, we
introduce UniTraj, a comprehensive framework that unifies various datasets,
models, and evaluation criteria, presenting new opportunities for the vehicle
trajectory prediction field. In particular, using UniTraj, we conduct extensive
experiments and find that model performance significantly drops when
transferred to other datasets. However, enlarging data size and diversity can
substantially improve performance, leading to a new state-of-the-art result for
the nuScenes dataset. We provide insights into dataset characteristics to
explain these findings. The code can be found here:
https://github.com/vita-epfl/UniTraj
♻ ☆ CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation
Monika Wysoczańska, Oriane Siméoni, Michaël Ramamonjisoa, Andrei Bursuc, Tomasz Trzciński, Patrick Pérez
The popular CLIP model displays impressive zero-shot capabilities thanks to
its seamless interaction with arbitrary text prompts. However, its lack of
spatial awareness makes it unsuitable for dense computer vision tasks, e.g.,
semantic segmentation, without an additional fine-tuning step that often uses
annotations and can potentially suppress its original open-vocabulary
properties. Meanwhile, self-supervised representation methods have demonstrated
good localization properties without human-made annotations nor explicit
supervision. In this work, we take the best of both worlds and propose an
open-vocabulary semantic segmentation method, which does not require any
annotations. We propose to locally improve dense MaskCLIP features, which are
computed with a simple modification of CLIP's last pooling layer, by
integrating localization priors extracted from self-supervised features. By
doing so, we greatly improve the performance of MaskCLIP and produce smooth
outputs. Moreover, we show that the used self-supervised feature properties can
directly be learnt from CLIP features. Our method CLIP-DINOiser needs only a
single forward pass of CLIP and two light convolutional layers at inference, no
extra supervision nor extra memory and reaches state-of-the-art results on
challenging and fine-grained benchmarks such as COCO, Pascal Context,
Cityscapes and ADE20k. The code to reproduce our results is available at
https://github.com/wysoczanska/clip_dinoiser.
♻ ☆ Continual-MAE: Adaptive Distribution Masked Autoencoders for Continual Test-Time Adaptation CVPR2024
Jiaming Liu, Ran Xu, Senqiao Yang, Renrui Zhang, Qizhe Zhang, Zehui Chen, Yandong Guo, Shanghang Zhang
Continual Test-Time Adaptation (CTTA) is proposed to migrate a source
pre-trained model to continually changing target distributions, addressing
real-world dynamism. Existing CTTA methods mainly rely on entropy minimization
or teacher-student pseudo-labeling schemes for knowledge extraction in
unlabeled target domains. However, dynamic data distributions cause
miscalibrated predictions and noisy pseudo-labels in existing self-supervised
learning methods, hindering the effective mitigation of error accumulation and
catastrophic forgetting problems during the continual adaptation process. To
tackle these issues, we propose a continual self-supervised method, Adaptive
Distribution Masked Autoencoders (ADMA), which enhances the extraction of
target domain knowledge while mitigating the accumulation of distribution
shifts. Specifically, we propose a Distribution-aware Masking (DaM) mechanism
to adaptively sample masked positions, followed by establishing consistency
constraints between the masked target samples and the original target samples.
Additionally, for masked tokens, we utilize an efficient decoder to reconstruct
a hand-crafted feature descriptor (e.g., Histograms of Oriented Gradients),
leveraging its invariant properties to boost task-relevant representations.
Through conducting extensive experiments on four widely recognized benchmarks,
our proposed method attains state-of-the-art performance in both classification
and segmentation CTTA tasks. Our project page:
https://sites.google.com/view/continual-mae/home.
comment: Accepted by CVPR2024
♻ ☆ A2V: A Semi-Supervised Domain Adaptation Framework for Brain Vessel Segmentation via Two-Phase Training Angiography-to-Venography Translation BMVC
Francesco Galati, Daniele Falcetta, Rosa Cortese, Barbara Casolla, Ferran Prados, Ninon Burgos, Maria A. Zuluaga
We present a semi-supervised domain adaptation framework for brain vessel
segmentation from different image modalities. Existing state-of-the-art methods
focus on a single modality, despite the wide range of available cerebrovascular
imaging techniques. This can lead to significant distribution shifts that
negatively impact the generalization across modalities. By relying on annotated
angiographies and a limited number of annotated venographies, our framework
accomplishes image-to-image translation and semantic segmentation, leveraging a
disentangled and semantically rich latent space to represent heterogeneous data
and perform image-level adaptation from source to target domains. Moreover, we
reduce the typical complexity of cycle-based architectures and minimize the use
of adversarial training, which allows us to build an efficient and intuitive
model with stable training. We evaluate our method on magnetic resonance
angiographies and venographies. While achieving state-of-the-art performance in
the source domain, our method attains a Dice score coefficient in the target
domain that is only 8.9% lower, highlighting its promising potential for robust
cerebrovascular image segmentation across different modalities.
comment: Accepted at the 34th British Machine Vision Conference (BMVC)
♻ ☆ Debiasing Multimodal Large Language Models
In the realms of computer vision and natural language processing, Large
Vision-Language Models (LVLMs) have become indispensable tools, proficient in
generating textual descriptions based on visual inputs. Despite their
advancements, our investigation reveals a noteworthy bias in the generated
content, where the output is primarily influenced by the underlying Large
Language Models (LLMs) prior rather than the input image. Our empirical
experiments underscore the persistence of this bias, as LVLMs often provide
confident answers even in the absence of relevant images or given incongruent
visual input. To rectify these biases and redirect the model's focus toward
vision information, we introduce two simple, training-free strategies. Firstly,
for tasks such as classification or multi-choice question-answering (QA), we
propose a ``calibration'' step through affine transformation to adjust the
output distribution. This ``Post-Hoc debias'' approach ensures uniform scores
for each answer when the image is absent, serving as an effective
regularization technique to alleviate the influence of LLM priors. For more
intricate open-ended generation tasks, we extend this method to ``Debias
sampling'', drawing inspirations from contrastive decoding methods.
Furthermore, our investigation sheds light on the instability of LVLMs across
various decoding configurations. Through systematic exploration of different
settings, we significantly enhance performance, surpassing reported results and
raising concerns about the fairness of existing evaluations. Comprehensive
experiments substantiate the effectiveness of our proposed strategies in
mitigating biases. These strategies not only prove beneficial in minimizing
hallucinations but also contribute to the generation of more helpful and
precise illustrations.
comment: 38 pages, 17 figures
♻ ☆ SIGNeRF: Scene Integrated Generation for Neural Radiance Fields
Advances in image diffusion models have recently led to notable improvements
in the generation of high-quality images. In combination with Neural Radiance
Fields (NeRFs), they enabled new opportunities in 3D generation. However, most
generative 3D approaches are object-centric and applying them to editing
existing photorealistic scenes is not trivial. We propose SIGNeRF, a novel
approach for fast and controllable NeRF scene editing and scene-integrated
object generation. A new generative update strategy ensures 3D consistency
across the edited images, without requiring iterative optimization. We find
that depth-conditioned diffusion models inherently possess the capability to
generate 3D consistent views by requesting a grid of images instead of single
views. Based on these insights, we introduce a multi-view reference sheet of
modified images. Our method updates an image collection consistently based on
the reference sheet and refines the original NeRF with the newly generated
image set in one go. By exploiting the depth conditioning mechanism of the
image diffusion model, we gain fine control over the spatial location of the
edit and enforce shape guidance by a selected region or an external mesh.
comment: Project Page: https://signerf.jdihlmann.com
♻ ☆ LocalStyleFool: Regional Video Style Transfer Attack Using Segment Anything Model SP
Previous work has shown that well-crafted adversarial perturbations can
threaten the security of video recognition systems. Attackers can invade such
models with a low query budget when the perturbations are semantic-invariant,
such as StyleFool. Despite the query efficiency, the naturalness of the minutia
areas still requires amelioration, since StyleFool leverages style transfer to
all pixels in each frame. To close the gap, we propose LocalStyleFool, an
improved black-box video adversarial attack that superimposes regional
style-transfer-based perturbations on videos. Benefiting from the popularity
and scalably usability of Segment Anything Model (SAM), we first extract
different regions according to semantic information and then track them through
the video stream to maintain the temporal consistency. Then, we add
style-transfer-based perturbations to several regions selected based on the
associative criterion of transfer-based gradient information and regional area.
Perturbation fine adjustment is followed to make stylized videos adversarial.
We demonstrate that LocalStyleFool can improve both intra-frame and inter-frame
naturalness through a human-assessed survey, while maintaining competitive
fooling rate and query efficiency. Successful experiments on the
high-resolution dataset also showcase that scrupulous segmentation of SAM helps
to improve the scalability of adversarial attacks under high-resolution data.
comment: Accepted to 2024 IEEE Security and Privacy Workshops (SPW)
♻ ☆ TULIP: Transformer for Upsampling of LiDAR Point Cloud CVPR20224
LiDAR Upsampling is a challenging task for the perception systems of robots
and autonomous vehicles, due to the sparse and irregular structure of
large-scale scene contexts. Recent works propose to solve this problem by
converting LiDAR data from 3D Euclidean space into an image super-resolution
problem in 2D image space. Although their methods can generate high-resolution
range images with fine-grained details, the resulting 3D point clouds often
blur out details and predict invalid points. In this paper, we propose TULIP, a
new method to reconstruct high-resolution LiDAR point clouds from
low-resolution LiDAR input. We also follow a range image-based approach but
specifically modify the patch and window geometries of a Swin-Transformer-based
network to better fit the characteristics of range images. We conducted several
experiments on three public real-world and simulated datasets. TULIP
outperforms state-of-the-art methods in all relevant metrics and generates
robust and more realistic point clouds than prior works.
comment: The paper was accepted by CVPR20224
♻ ☆ 3D Face Reconstruction Using A Spectral-Based Graph Convolution Encoder WWW 2024
Monocular 3D face reconstruction plays a crucial role in avatar generation,
with significant demand in web-related applications such as generating virtual
financial advisors in FinTech. Current reconstruction methods predominantly
rely on deep learning techniques and employ 2D self-supervision as a means to
guide model learning. However, these methods encounter challenges in capturing
the comprehensive 3D structural information of the face due to the utilization
of 2D images for model training purposes. To overcome this limitation and
enhance the reconstruction of 3D structural features, we propose an innovative
approach that integrates existing 2D features with 3D features to guide the
model learning process. Specifically, we introduce the 3D-ID Loss, which
leverages the high-dimensional structure features extracted from a
Spectral-Based Graph Convolution Encoder applied to the facial mesh. This
approach surpasses the sole reliance on the 3D information provided by the
facial mesh vertices coordinates. Our model is trained using 2D-3D data pairs
from a combination of datasets and achieves state-of-the-art performance on the
NoW benchmark.
comment: 4 pages, 3 figures. Accepted to WWW 2024
♻ ☆ AEROBLADE: Training-Free Detection of Latent Diffusion Images Using Autoencoder Reconstruction Error CVPR 2024
With recent text-to-image models, anyone can generate deceptively realistic
images with arbitrary contents, fueling the growing threat of visual
disinformation. A key enabler for generating high-resolution images with low
computational cost has been the development of latent diffusion models (LDMs).
In contrast to conventional diffusion models, LDMs perform the denoising
process in the low-dimensional latent space of a pre-trained autoencoder (AE)
instead of the high-dimensional image space. Despite their relevance, the
forensic analysis of LDMs is still in its infancy. In this work we propose
AEROBLADE, a novel detection method which exploits an inherent component of
LDMs: the AE used to transform images between image and latent space. We find
that generated images can be more accurately reconstructed by the AE than real
images, allowing for a simple detection approach based on the reconstruction
error. Most importantly, our method is easy to implement and does not require
any training, yet nearly matches the performance of detectors that rely on
extensive training. We empirically demonstrate that AEROBLADE is effective
against state-of-the-art LDMs, including Stable Diffusion and Midjourney.
Beyond detection, our approach allows for the qualitative analysis of images,
which can be leveraged for identifying inpainted regions. We release our code
and data at https://github.com/jonasricker/aeroblade .
comment: Accepted to CVPR 2024
♻ ☆ A citizen science toolkit to collect human perceptions of urban environments using open street view images
Street View-level Imagery (SVI) is a valuable data source for studies (e.g.,
environmental assessments, green space identification or land cover
classification). While commercial SVI is available, such providers commonly
restrict copying or reuse in ways necessary for research. Open SVI datasets are
readily available from less restrictive sources, such as Mapillary, but due to
the heterogeneity of the images, these require substantial preprocessing,
filtering, and careful quality checks. We present an efficient method for
automated downloading, processing, cropping, and filtering open SVI, to be used
in a survey of human perceptions of the streets portrayed in these images. We
demonstrate our open-source reusable SVI preparation and smartphone-friendly
perception-survey software with Amsterdam (Netherlands) as the case study.
Using a citizen science approach, we collected from 331 people 22,637 ratings
about their perceptions for various criteria. We have published our software in
a public repository for future re-use and reproducibility.
♻ ☆ Scalable Non-Cartesian Magnetic Resonance Imaging with R2D2
We propose a new approach for non-Cartesian magnetic resonance image
reconstruction. While unrolled architectures provide robustness via
data-consistency layers, embedding measurement operators in Deep Neural Network
(DNN) can become impractical at large scale. Alternative Plug-and-Play (PnP)
approaches, where the denoising DNNs are blind to the measurement setting, are
not affected by this limitation and have also proven effective, but their
highly iterative nature also affects scalability. To address this scalability
challenge, we leverage the "Residual-to-Residual DNN series for high-Dynamic
range imaging (R2D2)" approach recently introduced in astronomical imaging.
R2D2's reconstruction is formed as a series of residual images, iteratively
estimated as outputs of DNNs taking the previous iteration's image estimate and
associated data residual as inputs. The method can be interpreted as a learned
version of the Matching Pursuit algorithm. We demonstrate R2D2 in simulation,
considering radial k-space sampling acquisition sequences. Our preliminary
results suggest that R2D2 achieves: (i) suboptimal performance compared to its
unrolled incarnation R2D2-Net, which is however non-scalable due to the
necessary embedding of NUFFT-based data-consistency layers; (ii) superior
reconstruction quality to a scalable version of R2D2-Net embedding an FFT-based
approximation for data consistency; (iii) superior reconstruction quality to
PnP, while only requiring few iterations.
comment: submitted to IEEE EUSIPCO 2024
♻ ☆ FoMo-Bench: a multi-modal, multi-scale and multi-task Forest Monitoring Benchmark for remote sensing foundation models
Forests are an essential part of Earth's ecosystems and natural systems, as
well as providing services on which humanity depends, yet they are rapidly
changing as a result of land use decisions and climate change. Understanding
and mitigating negative effects requires parsing data on forests at global
scale from a broad array of sensory modalities, and recently many such problems
have been approached using machine learning algorithms for remote sensing. To
date, forest-monitoring problems have largely been addressed in isolation.
Inspired by the rise of foundation models for computer vision and remote
sensing, we here present the first unified Forest Monitoring Benchmark
(FoMo-Bench). FoMo-Bench consists of 15 diverse datasets encompassing
satellite, aerial, and inventory data, covering a variety of geographical
regions, and including multispectral, red-green-blue, synthetic aperture radar
(SAR) and LiDAR data with various temporal, spatial and spectral resolutions.
FoMo-Bench includes multiple types of forest-monitoring tasks, spanning
classification, segmentation, and object detection. To further enhance the
diversity of tasks and geographies represented in FoMo-Bench, we introduce a
novel global dataset, TalloS, combining satellite imagery with ground-based
annotations for tree species classification, encompassing 1,000+ categories
across multiple hierarchical taxonomic levels (species, genus, family).
Finally, we propose FoMo-Net, a baseline foundation model with the capacity to
process any combination of commonly used spectral bands in remote sensing,
across diverse ground sampling distances and geographical locations worldwide.
This work aims to inspire research collaborations between machine learning and
forest biology researchers in exploring scalable multi-modal and multi-task
models for forest monitoring. All code and data will be made publicly
available.
comment: 26 pages
♻ ☆ Retrieval-Augmented Generation for AI-Generated Content: A Survey
Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, Bin Cui
The development of Artificial Intelligence Generated Content (AIGC) has been
facilitated by advancements in model algorithms, the increasing scale of
foundation models, and the availability of ample high-quality datasets. While
AIGC has achieved remarkable performance, it still faces several challenges,
such as the difficulty of maintaining up-to-date and long-tail knowledge, the
risk of data leakage, and the high costs associated with training and
inference. Retrieval-Augmented Generation(RAG) has recently emerged as a
paradigm to address such challenges. In particular, RAG introduces the
information retrieval process, which enhances the generation process by
retrieving relevant objects from available data stores, leading to higher
accuracy and better robustness. In this paper, we comprehensively review
existing efforts that integrate RAG technique into AIGC scenarios. We first
classify RAG foundations according to how the retriever augments the generator,
distilling the fundamental abstractions of the augmentation methodologies for
various retrievers and generators. This unified perspective encompasses all RAG
scenarios, illuminating advancements and pivotal technologies that help with
potential future progress. We also summarize additional enhancements methods
for RAG, facilitating effective engineering and implementation of RAG systems.
Then from another view, we survey on practical applications of RAG across
different modalities and tasks, offering valuable references for researchers
and practitioners. Furthermore, we introduce the benchmarks for RAG, discuss
the limitations of current RAG systems, and suggest potential directions for
future research.Project Repo: https://github.com/hymie122/RAG-Survey.
comment: Citing 380 papers, 36 pages, 16 figures. Project:
https://github.com/hymie122/RAG-Survey
♻ ☆ Learning Concept-Based Causal Transition and Symbolic Reasoning for Visual Planning
Visual planning simulates how humans make decisions to achieve desired goals
in the form of searching for visual causal transitions between an initial
visual state and a final visual goal state. It has become increasingly
important in egocentric vision with its advantages in guiding agents to perform
daily tasks in complex environments. In this paper, we propose an interpretable
and generalizable visual planning framework consisting of i) a novel
Substitution-based Concept Learner (SCL) that abstracts visual inputs into
disentangled concept representations, ii) symbol abstraction and reasoning that
performs task planning via the self-learned symbols, and iii) a Visual Causal
Transition model (ViCT) that grounds visual causal transitions to semantically
similar real-world actions. Given an initial state, we perform goal-conditioned
visual planning with a symbolic reasoning method fueled by the learned
representations and causal transitions to reach the goal state. To verify the
effectiveness of the proposed model, we collect a large-scale visual planning
dataset based on AI2-THOR, dubbed as CCTP. Extensive experiments on this
challenging dataset demonstrate the superior performance of our method in
visual task planning. Empirically, we show that our framework can generalize to
unseen task trajectories, unseen object categories, and real-world data.
Further details of this work are provided at
https://fqyqc.github.io/ConTranPlan/.
♻ ☆ Centered Masking for Language-Image Pre-Training
We introduce Gaussian masking for Language-Image Pre-Training (GLIP) a novel,
straightforward, and effective technique for masking image patches during
pre-training of a vision-language model. GLIP builds on Fast Language-Image
Pre-Training (FLIP), which randomly masks image patches while training a CLIP
model. GLIP replaces random masking with centered masking, that uses a Gaussian
distribution and is inspired by the importance of image patches at the center
of the image. GLIP retains the same computational savings as FLIP, while
improving performance across a range of downstream datasets and tasks, as
demonstrated by our experimental results. We show the benefits of GLIP to be
easy to obtain, requiring no delicate tuning of the Gaussian, and also
applicable to data sets containing images without an obvious center focus.
♻ ☆ Physical 3D Adversarial Attacks against Monocular Depth Estimation in Autonomous Driving CVPR 2024
Deep learning-based monocular depth estimation (MDE), extensively applied in
autonomous driving, is known to be vulnerable to adversarial attacks. Previous
physical attacks against MDE models rely on 2D adversarial patches, so they
only affect a small, localized region in the MDE map but fail under various
viewpoints. To address these limitations, we propose 3D Depth Fool
(3D$^2$Fool), the first 3D texture-based adversarial attack against MDE models.
3D$^2$Fool is specifically optimized to generate 3D adversarial textures
agnostic to model types of vehicles and to have improved robustness in bad
weather conditions, such as rain and fog. Experimental results validate the
superior performance of our 3D$^2$Fool across various scenarios, including
vehicles, MDE models, weather conditions, and viewpoints. Real-world
experiments with printed 3D textures on physical vehicle models further
demonstrate that our 3D$^2$Fool can cause an MDE error of over 10 meters.
comment: Accepted by CVPR 2024
♻ ☆ Weakly-Supervised Conditional Embedding for Referred Visual Search
This paper introduces a new challenge for image similarity search in the
context of fashion, addressing the inherent ambiguity in this domain stemming
from complex images. We present Referred Visual Search (RVS), a task allowing
users to define more precisely the desired similarity, following recent
interest in the industry. We release a new large public dataset,
LAION-RVS-Fashion, consisting of 272k fashion products with 842k images
extracted from LAION, designed explicitly for this task. However, unlike
traditional visual search methods in the industry, we demonstrate that superior
performance can be achieved by bypassing explicit object detection and adopting
weakly-supervised conditional contrastive learning on image tuples. Our method
is lightweight and demonstrates robustness, reaching Recall at one superior to
strong detection-based baselines against 2M distractors. Code, data and models
are available at https://www.github.com/Simon-Lepage/CondViT-LRVSF .
comment: 28 pages, 13 figures, 5 tables
♻ ☆ Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers CVPR
Vision Transformer (ViT) has emerged as a prominent backbone for computer
vision. For more efficient ViTs, recent works lessen the quadratic cost of the
self-attention layer by pruning or fusing the redundant tokens. However, these
works faced the speed-accuracy trade-off caused by the loss of information.
Here, we argue that token fusion needs to consider diverse relations between
tokens to minimize information loss. In this paper, we propose a Multi-criteria
Token Fusion (MCTF), that gradually fuses the tokens based on multi-criteria
(e.g., similarity, informativeness, and size of fused tokens). Further, we
utilize the one-step-ahead attention, which is the improved approach to capture
the informativeness of the tokens. By training the model equipped with MCTF
using a token reduction consistency, we achieve the best speed-accuracy
trade-off in the image classification (ImageNet1K). Experimental results prove
that MCTF consistently surpasses the previous reduction methods with and
without training. Specifically, DeiT-T and DeiT-S with MCTF reduce FLOPs by
about 44% while improving the performance (+0.5%, and +0.3%) over the base
model, respectively. We also demonstrate the applicability of MCTF in various
Vision Transformers (e.g., T2T-ViT, LV-ViT), achieving at least 31% speedup
without performance degradation. Code is available at
https://github.com/mlvlab/MCTF.
comment: Conference on Computer Vision and Pattern Recognition (CVPR), 2024
♻ ☆ Task-Adaptive Saliency Guidance for Exemplar-free Class Incremental Learning CVPR 2024
Exemplar-free Class Incremental Learning (EFCIL) aims to sequentially learn
tasks with access only to data from the current one. EFCIL is of interest
because it mitigates concerns about privacy and long-term storage of data,
while at the same time alleviating the problem of catastrophic forgetting in
incremental learning. In this work, we introduce task-adaptive saliency for
EFCIL and propose a new framework, which we call Task-Adaptive Saliency
Supervision (TASS), for mitigating the negative effects of saliency drift
between different tasks. We first apply boundary-guided saliency to maintain
task adaptivity and \textit{plasticity} on model attention. Besides, we
introduce task-agnostic low-level signals as auxiliary supervision to increase
the \textit{stability} of model attention. Finally, we introduce a module for
injecting and recovering saliency noise to increase the robustness of saliency
preservation. Our experiments demonstrate that our method can better preserve
saliency maps across tasks and achieve state-of-the-art results on the
CIFAR-100, Tiny-ImageNet, and ImageNet-Subset EFCIL benchmarks. Code is
available at \url{https://github.com/scok30/tass}.
comment: Accepted at CVPR 2024
♻ ☆ The Effects of Mixed Sample Data Augmentation are Class Dependent
Mixed Sample Data Augmentation (MSDA) techniques, such as Mixup, CutMix, and
PuzzleMix, have been widely acknowledged for enhancing performance in a variety
of tasks. A previous study reported the class dependency of traditional data
augmentation (DA), where certain classes benefit disproportionately compared to
others. This paper reveals a class dependent effect of MSDA, where some classes
experience improved performance while others experience degraded performance.
This research addresses the issue of class dependency in MSDA and proposes an
algorithm to mitigate it. The approach involves training on a mixture of MSDA
and non-MSDA data, which not only mitigates the negative impact on the affected
classes, but also improves overall accuracy. Furthermore, we provide in-depth
analysis and discussion of why MSDA introduced class dependencies and which
classes are most likely to have them.
comment: 21 pages, 18 figures, Overall Revision
♻ ☆ Spectral Meets Spatial: Harmonising 3D Shape Matching and Interpolation CVPR2024
Although 3D shape matching and interpolation are highly interrelated, they
are often studied separately and applied sequentially to relate different 3D
shapes, thus resulting in sub-optimal performance. In this work we present a
unified framework to predict both point-wise correspondences and shape
interpolation between 3D shapes. To this end, we combine the deep functional
map framework with classical surface deformation models to map shapes in both
spectral and spatial domains. On the one hand, by incorporating spatial maps,
our method obtains more accurate and smooth point-wise correspondences compared
to previous functional map methods for shape matching. On the other hand, by
introducing spectral maps, our method gets rid of commonly used but
computationally expensive geodesic distance constraints that are only valid for
near-isometric shape deformations. Furthermore, we propose a novel test-time
adaptation scheme to capture both pose-dominant and shape-dominant
deformations. Using different challenging datasets, we demonstrate that our
method outperforms previous state-of-the-art methods for both shape matching
and interpolation, even compared to supervised approaches.
comment: accepted by CVPR2024
♻ ☆ CEIMVEN: An Approach of Cutting Edge Implementation of Modified Versions of EfficientNet (V1-V2) Architecture for Breast Cancer Detection and Classification from Ultrasound Images
Undoubtedly breast cancer identifies itself as one of the most widespread and
terrifying cancers across the globe. Millions of women are getting affected
each year from it. Breast cancer remains the major one for being the reason of
largest number of demise of women. In the recent time of research, Medical
Image Computing and Processing has been playing a significant role for
detecting and classifying breast cancers from ultrasound images and mammograms,
along with the celestial touch of deep neural networks. In this research, we
focused mostly on our rigorous implementations and iterative result analysis of
different cutting-edge modified versions of EfficientNet architectures namely
EfficientNet-V1 (b0-b7) and EfficientNet-V2 (b0-b3) with ultrasound image,
named as CEIMVEN. We utilized transfer learning approach here for using the
pre-trained models of EfficientNet versions. We activated the hyper-parameter
tuning procedures, added fully connected layers, discarded the unprecedented
outliers and recorded the accuracy results from our custom modified
EfficientNet architectures. Our deep learning model training approach was
related to both identifying the cancer affected areas with region of interest
(ROI) techniques and multiple classifications (benign, malignant and normal).
The approximate testing accuracies we got from the modified versions of
EfficientNet-V1 (b0- 99.15%, b1- 98.58%, b2- 98.43%, b3- 98.01%, b4- 98.86%,
b5- 97.72%, b6- 97.72%, b7- 98.72%) and EfficientNet-V2 (b0- 99.29%, b1-
99.01%, b2- 98.72%, b3- 99.43%) are showing very bright future and strong
potentials of deep learning approach for the successful detection and
classification of breast cancers from the ultrasound images at a very early
stage. The code for this research is available here:
https://github.com/ac005sheekar/CEIMVEN-Breast.
♻ ☆ ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions CVPR2024
Although Vision Transformer (ViT) has achieved significant success in
computer vision, it does not perform well in dense prediction tasks due to the
lack of inner-patch information interaction and the limited diversity of
feature scale. Most existing studies are devoted to designing vision-specific
transformers to solve the above problems, which introduce additional
pre-training costs. Therefore, we present a plain, pre-training-free, and
feature-enhanced ViT backbone with Convolutional Multi-scale feature
interaction, named ViT-CoMer, which facilitates bidirectional interaction
between CNN and transformer. Compared to the state-of-the-art, ViT-CoMer has
the following advantages: (1) We inject spatial pyramid multi-receptive field
convolutional features into the ViT architecture, which effectively alleviates
the problems of limited local information interaction and single-feature
representation in ViT. (2) We propose a simple and efficient CNN-Transformer
bidirectional fusion interaction module that performs multi-scale fusion across
hierarchical features, which is beneficial for handling dense prediction tasks.
(3) We evaluate the performance of ViT-CoMer across various dense prediction
tasks, different frameworks, and multiple advanced pre-training. Notably, our
ViT-CoMer-L achieves 64.3% AP on COCO val2017 without extra training data, and
62.1% mIoU on ADE20K val, both of which are comparable to state-of-the-art
methods. We hope ViT-CoMer can serve as a new backbone for dense prediction
tasks to facilitate future research. The code will be released at
https://github.com/Traffic-X/ViT-CoMer.
comment: CVPR2024
♻ ☆ InterControl: Generate Human Motion Interactions by Controlling Every Joint
Text-conditioned human motion synthesis has made remarkable progress with the
emergence of diffusion models in recent research. However, the majority of
these motion diffusion models are primarily designed for a single character and
overlook multi-human interactions. In our approach, we strive to explore this
problem by synthesizing human motion with interactions for a group of
characters of any size. The key aspect of our approach is the adaptation of
human-wise interactions as pairs of human joints that can be either in contact
or separated by a desired distance. In contrast to existing methods that
necessitate training motion generation models on multi-human motion datasets
with a fixed number of characters, our approach inherently possesses the
flexibility to model human interactions involving an arbitrary number of
individuals, thereby transcending the limitations imposed by the training data.
We introduce a novel controllable motion generation method, InterControl, to
encourage the synthesized motions maintaining the desired distance between
joint pairs. It consists of a motion controller and an inverse kinematics
guidance module that realistically and accurately aligns the joints of
synthesized characters to the desired location. Furthermore, we demonstrate
that the distance between joint pairs for human-wise interactions can be
generated using an off-the-shelf Large Language Model (LLM). Experimental
results highlight the capability of our framework to generate interactions with
multiple human characters and its potential to work with off-the-shelf
physics-based character simulators.
comment: Generate human interactions with only single-person data via joint
contact pairs, code https://github.com/zhenzhiwang/intercontrol
♻ ☆ SSM Meets Video Diffusion Models: Efficient Video Generation with Structured State Spaces ICLR 2024
Given the remarkable achievements in image generation through diffusion
models, the research community has shown increasing interest in extending these
models to video generation. Recent diffusion models for video generation have
predominantly utilized attention layers to extract temporal features. However,
attention layers are limited by their memory consumption, which increases
quadratically with the length of the sequence. This limitation presents
significant challenges when attempting to generate longer video sequences using
diffusion models. To overcome this challenge, we propose leveraging state-space
models (SSMs). SSMs have recently gained attention as viable alternatives due
to their linear memory consumption relative to sequence length. In the
experiments, we first evaluate our SSM-based model with UCF101, a standard
benchmark of video generation. In addition, to investigate the potential of
SSMs for longer video generation, we perform an experiment using the MineRL
Navigate dataset, varying the number of frames to 64, 200, and 400. In these
settings, our SSM-based model can considerably save memory consumption for
longer sequences, while maintaining competitive FVD scores to the
attention-based models. Our codes are available at
https://github.com/shim0114/SSM-Meets-Video-Diffusion-Models.
comment: Accepted as workshop paper at ICLR 2024
♻ ☆ Rotation-Invariant Transformer for Point Cloud Matching CVPR 2023
The intrinsic rotation invariance lies at the core of matching point clouds
with handcrafted descriptors. However, it is widely despised by recent deep
matchers that obtain the rotation invariance extrinsically via data
augmentation. As the finite number of augmented rotations can never span the
continuous SO(3) space, these methods usually show instability when facing
rotations that are rarely seen. To this end, we introduce RoITr, a
Rotation-Invariant Transformer to cope with the pose variations in the point
cloud matching task. We contribute both on the local and global levels.
Starting from the local level, we introduce an attention mechanism embedded
with Point Pair Feature (PPF)-based coordinates to describe the pose-invariant
geometry, upon which a novel attention-based encoder-decoder architecture is
constructed. We further propose a global transformer with rotation-invariant
cross-frame spatial awareness learned by the self-attention mechanism, which
significantly improves the feature distinctiveness and makes the model robust
with respect to the low overlap. Experiments are conducted on both the rigid
and non-rigid public benchmarks, where RoITr outperforms all the
state-of-the-art models by a considerable margin in the low-overlapping
scenarios. Especially when the rotations are enlarged on the challenging
3DLoMatch benchmark, RoITr surpasses the existing methods by at least 13 and 5
percentage points in terms of Inlier Ratio and Registration Recall,
respectively.
comment: Accepted to CVPR 2023
♻ ☆ Extend Your Own Correspondences: Unsupervised Distant Point Cloud Registration by Progressive Distance Extension CVPR
Registration of point clouds collected from a pair of distant vehicles
provides a comprehensive and accurate 3D view of the driving scenario, which is
vital for driving safety related applications, yet existing literature suffers
from the expensive pose label acquisition and the deficiency to generalize to
new data distributions. In this paper, we propose EYOC, an unsupervised distant
point cloud registration method that adapts to new point cloud distributions on
the fly, requiring no global pose labels. The core idea of EYOC is to train a
feature extractor in a progressive fashion, where in each round, the feature
extractor, trained with near point cloud pairs, can label slightly farther
point cloud pairs, enabling self-supervision on such far point cloud pairs.
This process continues until the derived extractor can be used to register
distant point clouds. Particularly, to enable high-fidelity correspondence
label generation, we devise an effective spatial filtering scheme to select the
most representative correspondences to register a point cloud pair, and then
utilize the aligned point clouds to discover more correct correspondences.
Experiments show that EYOC can achieve comparable performance with
state-of-the-art supervised methods at a lower training cost. Moreover, it
outwits supervised methods regarding generalization performance on new data
distributions.
comment: In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 2024
♻ ☆ Foundation Model Makes Clustering A Better Initialization For Cold-Start Active Learning
Active learning selects the most informative samples from the unlabelled
dataset to annotate in the context of a limited annotation budget. While
numerous methods have been proposed for subsequent sample selection based on an
initialized model, scant attention has been paid to the indispensable phase of
active learning: selecting samples for model cold-start initialization. Most of
the previous studies resort to random sampling or naive clustering. However,
random sampling is prone to fluctuation, and naive clustering suffers from
convergence speed, particularly when dealing with high-dimensional data such as
imaging data. In this work, we propose to integrate foundation models with
clustering methods to select samples for cold-start active learning
initialization. Foundation models refer to those trained on massive datasets by
the self-supervised paradigm and capable of generating informative and
compacted embeddings for various downstream tasks. Leveraging these embeddings
to replace raw features such as pixel values, clustering quickly converges and
identifies better initial samples. For a comprehensive comparison, we included
a classic ImageNet-supervised model to acquire embeddings. Experiments on two
clinical tasks of image classification and segmentation demonstrated that
foundation model-based clustering efficiently pinpointed informative initial
samples, leading to models showcasing enhanced performance than the baseline
methods. We envisage that this study provides an effective paradigm for future
cold-start active learning.
♻ ☆ DifFlow3D: Toward Robust Uncertainty-Aware Scene Flow Estimation with Iterative Diffusion-Based Refinement CVPR 2024
Jiuming Liu, Guangming Wang, Weicai Ye, Chaokang Jiang, Jinru Han, Zhe Liu, Guofeng Zhang, Dalong Du, Hesheng Wang
Scene flow estimation, which aims to predict per-point 3D displacements of
dynamic scenes, is a fundamental task in the computer vision field. However,
previous works commonly suffer from unreliable correlation caused by locally
constrained searching ranges, and struggle with accumulated inaccuracy arising
from the coarse-to-fine structure. To alleviate these problems, we propose a
novel uncertainty-aware scene flow estimation network (DifFlow3D) with the
diffusion probabilistic model. Iterative diffusion-based refinement is designed
to enhance the correlation robustness and resilience to challenging cases, e.g.
dynamics, noisy inputs, repetitive patterns, etc. To restrain the generation
diversity, three key flow-related features are leveraged as conditions in our
diffusion model. Furthermore, we also develop an uncertainty estimation module
within diffusion to evaluate the reliability of estimated scene flow. Our
DifFlow3D achieves state-of-the-art performance, with 24.0% and 29.1% EPE3D
reduction respectively on FlyingThings3D and KITTI 2015 datasets. Notably, our
method achieves an unprecedented millimeter-level accuracy (0.0078m in EPE3D)
on the KITTI dataset. Additionally, our diffusion-based refinement paradigm can
be readily integrated as a plug-and-play module into existing scene flow
networks, significantly increasing their estimation accuracy. Codes are
released at https://github.com/IRMVLab/DifFlow3D.
comment: Camera-ready version of CVPR 2024. Codes are released at
https://github.com/IRMVLab/DifFlow3D
♻ ☆ Task-wise Sampling Convolutions for Arbitrary-Oriented Object Detection in Aerial Images
Arbitrary-oriented object detection (AOOD) has been widely applied to locate
and classify objects with diverse orientations in remote sensing images.
However, the inconsistent features for the localization and classification
tasks in AOOD models may lead to ambiguity and low-quality object predictions,
which constrains the detection performance. In this article, an AOOD method
called task-wise sampling convolutions (TS-Conv) is proposed. TS-Conv
adaptively samples task-wise features from respective sensitive regions and
maps these features together in alignment to guide a dynamic label assignment
for better predictions. Specifically, sampling positions of the localization
convolution in TS-Conv are supervised by the oriented bounding box (OBB)
prediction associated with spatial coordinates, while sampling positions and
convolutional kernel of the classification convolution are designed to be
adaptively adjusted according to different orientations for improving the
orientation robustness of features. Furthermore, a dynamic
task-consistent-aware label assignment (DTLA) strategy is developed to select
optimal candidate positions and assign labels dynamically according to ranked
task-aware scores obtained from TS-Conv. Extensive experiments on several
public datasets covering multiple scenes, multimodal images, and multiple
categories of objects demonstrate the effectiveness, scalability, and superior
performance of the proposed TS-Conv.
comment: 15 pages, 13 figures, 11 tables